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This is a tutorial on some basic non-asymptotic methods and concepts in random matrix 
theory. The reader will learn several tools for the analysis of the extreme singular values 
of random matrices with independent rows or columns. Many of these methods sprung 
off from the development of geometric functional analysis since the 1970's. They have 
applications in several fields, most notably in theoretical computer science, statistics and 
signal processing. A few basic applications are covered in this text, particularly for the 
problem of estimating covariance matrices in statistics and for validating probabilistic 
constructions of measurement matrices in compressed sensing. These notes are written 
particularly for graduate students and beginning researchers in different areas, includ- 
ing functional analysts, probabilists, theoretical statisticians, electrical engineers, and 
theoretical computer scientists. 

5.1 Introduction 

Asymptotic and non-asymptotic regimes Random matrix theory studies proper- 
ties oi N X n matrices A chosen from some distribution on the set of all matrices. As 
dimensions N and n grow to infinity, one observes that the spectrum of A tends to sta- 
bilize. This is manifested in several limit laws, which may be regarded as random matrix 
versions of the central limit theorem. Among them is Wigner's semicircle law for the 
eigenvalues of symmetric Gaussian matrices, the circular law for Gaussian matrices, the 
Marchenko-Pastur law for Wishart matrices W = A* A where A is a Gaussian matrix, 
the Bai-Yin and Tracy- Widom laws for the extreme eigenvalues of Wishart matrices W. 
The books [511 [5l [23l |6] offer thorough introduction to the classical problems of random 
matrix theory and its fascinating connections. 

The asymptotic regime where the dimensions N,n oo is well suited for the purposes 
of statistical physics, e.g. when random matrices serve as finite-dimensional models of 
infinite-dimensional operators. But in some other areas including statistics, geometric 
functional analysis, and compressed sensing, the limiting regime may not be very useful 
|69j . Suppose, for example, that we ask about the largest singular value Smax(^) (i-e. the 
largest eigenvalue of {A*A)^/-^); to be specific assume that A is an n x n matrix whose 
entries are independent standard normal random variables. The asymptotic random 
matrix theory answers this question as follows: the Bai-Yin law (see Theorem I5.3ip 
states that 

Sinax{A)/2^/n ^ I almost surely 

as the dimension n —> oo. Moreover, the limiting distribution of Sniax(^) is known to be 
the Tracy- Widom law (see [Til 122] )■ In contrast to this, a non-asymptotic answer to the 
same question is the following: in every dimension n, one has 

Smax(^) < C\fn with probability at least 1 — e^", 

here C is an absolute constant (see Theorems 15.321 and I5.39P . The latter answer is less 
precise (because of an absolute constant C) but more quantitative because for fixed 
dimensions n it gives an exponential probability of success^ This is the kind of answer 

^For this specific model (Gaussian matrices). Theorems 15.321 and [5.351 even give a sharp absolute 
constant C ~ 2 here. But the result mentioned here is much more general as we will see later; it only 
requires independence of rows or columns of A. 
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we will seek in this text - guarantees up to absolute constants in all dimensions, and 
with large probability. 

Tall matrices are approximate isometries The following heuristic will be our 
guideline: tall random matrices should act as approximate isometries. So, an iV x n 
random matrix A with iV 3> n should act almost like an isometric embedding of £2 into 

(1 - S)K\\x\\2 < \\Ax\\2 < (1 + S)K\\x\\2 for all x € 

where K is an appropriate normalization factor and (5 <C 1. Equivalently, this says that 
all the singular values of A are close to each other: 

(1 - 5)K < S^in(A) < S^ax(A) < (1 + S)K, 

where Sniin(^) and Smax(^) denote the smallest and the largest singular values of A. Yet 
equivalently, this means that tall matrices are well conditioned: the condition number of 
A is KiA) = w(A)/s„,in(A) < (1 + S)/{1 - (5) « 1. 

In the asymptotic regime and for random matrices with independent entries, our 
heuristic is justified by Bai- Yin's law, which is Theorem 15.311 below. Loosely speaking, 
it states that as the dimensions N, n increase to infinity while the aspect ratio N/n is 
fixed, we have 

\/7V-V^«S,nin(A)<S„iax(v4)«%/^+V^. (5.1) 

In these notes, we study N xn random matrices A with independent rows or independent 
columns, but not necessarily independent entries. We develop non-asymptotic versions 
of ()5.1|) for such matrices, which should hold for all dimensions N and n. The desired 
results should have the form 

VN - CV^ < Snn„(A) < S„iax(A) < VN + (5.2) 

with large probability, e.g. 1 — e~^, where C is an absolute constant^ For tall matrices, 
where iV ^ n, both sides of this inequality would be close to each other, which would 
guarantee that A is an approximate isometry. 

Models and methods We shall study quite general models of random matrices - those 
with independent rows or independent columns that are sampled from high-dimensional 
distributions. We will place either strong moment assumptions on the distribution (sub- 
gaussian growth of moments) , or no moment assumptions at all (except finite variance) . 
This leads us to four types of main results: 

1. Matrices with independent sub-gaussian rows: Theorem 15.391 

2. Matrices with independent heavy-tailed rows: Theorem 15.411 

3. Matrices with independent sub-gaussian columns: Theorem 15.581 

4. Matrices with independent heavy-tailed columns: Theorem 15.621 



^More accurately, we should expect C = 0(1) to depend on easily computable quantities of the 
distribution, such as its moments. This will be clear from the context. 
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These four models cover many natural classes of random matrices that occur in ap- 
plications, including random matrices with independent entries (Gaussian and Bernoulli 
in particular) and random sub-matrices of orthogonal matrices (random Fourier matrices 
in particular). 

The analysis of these four models is based on a variety of tools of probability theory 
and geometric functional analysis, most of which have not been covered in the texts on 
the "classical" random matrix theory. The reader will learn basics on sub-gaussian and 
sub-exponential random variables, isotropic random vectors, large deviation inequalities 
for sums of independent random variables, extensions of these inequalities to random ma- 
trices, and several basic methods of high dimensional probability such as symmetrization, 
decoupling, and covering (e-net) arguments. 

Applications In these notes we shall emphasize two applications, one in statistics and 
one in compressed sensing. Our analysis of random matrices with independent rows 
immediately applies to a basic problem in statistics - estimating covariance matrices of 
high-dimensional distributions. If a random matrix A has i.i.d. rows Ai, then A*A = 
Xli (8) Ai is the sample covariance matrix. If A has independent columns Aj, then 
A* A = ({Aj , Ak))j^k is the Gram matrix. Thus our analysis of the row-independent and 
column-independent models can be interpreted as a study of sample covariance matrices 
and Gram matrices of high dimensional distributions. We will see in Section [5.4.3l that for 
a general distribution in M", its covariance matrix can be estimated from a sample of size 
N = 0{n\ogn) drawn from the distribution. Moreover, for sub-gaussian distributions 
we have an even better bound N = 0{n). For low-dimensional distributions, much fewer 
samples are needed - if a distribution lies close to a subspace of dimension r in R", then 
a sample of size N — 0{r log n) is sufficient for covariance estimation. 

In compressed sensing, the best known measurement matrices are random. A suffi- 
cient condition for a matrix to succeed for the purposes of compressed sensing is given by 
the restricted isometry property. Loosely speaking, this property demands that all sub- 
matrices of given size be well-conditioned. This fits well in the circle of problems of the 
non-asymptotic random matrix theory. Indeed, we will see in Section 15.61 that all basic 
models of random matrices are nice restricted isometrics. These include Gaussian and 
Bernoulli matrices, more generally all matrices with sub-gaussian independent entries, 
and even more generally all matrices with sub-gaussian independent rows or columns. 
Also, the class of restricted isometrics includes random Fourier matrices, more generally 
random sub-matrices of bounded orthogonal matrices, and even more generally matri- 
ces whose rows are independent samples from an isotropic distribution with uniformly 
bounded coordinates. 

Related sources This text is a tutorial rather than a survey, so we focus on explaining 
methods rather than results. This forces us to make some concessions in our choice of 
the subjects. Concentration of measure and its applications to random matrix theory 
are only briefly mentioned. For an introduction into concentration of measure suitable 
for a beginner, see [5] and [IHl Chapter 14]; for a thorough exposition see [5SJ 
for connections with random matrices see [21] |44] . The monograph [45] also offers an 
introduction into concentration of measure and related probabilistic methods in analysis 
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and geometry, some of which we shall use in these notes. 

We completely avoid the important (but more difficult) model of symmetric random 
matrices with independent entries on and above the diagonal. Starting from the work 
of Fiiredi and Komlos [29] , the largest singular value (the spectral norm) of symmetric 
random matrices has been a subject of study in many works; see e.g. [501 HSl El] and 
the references therein. 

We also did not even attempt to discuss sharp small deviation inequalities (of Tracy- 
Widom type) for the extreme eigenvalues. Both these topics and much more are discussed 
in the surveys [2Tl|44l|69], which serve as bridges between asymptotic and non-asymptotic 
problems in random matrix theory. 

Because of the absolute constant C in ()5.2p . our analysis of the smallest singular 
value (the "hard edge") will only be useful for sufficiently tall matrices, where N > 
C'^n. For square and almost square matrices, the hard edge problem will be only briefly 
mentioned in Section 15.31 The surveys [76] [69] discuss this problem at length, and they 
offer a glimpse of connections to other problems of random matrix theory and additive 
combinatorics. 

Many of the results and methods presented in these notes are known in one form 
or another. Some of them are published while some others belong to the folklore of 
probability in Banach spaces, geometric functional analysis, and related areas. When 
available, historic references are given in Section [5.71 

Acknowledgements The author is grateful to the colleagues who made a number of 
improving suggestions for the earlier versions of the manuscript, in particular to Richard 
Chen, Subhroshekhar Ghosh, Alexander Litvak, Deanna Needell, Holger Rauhut, S V 
N Vishwanathan and the anonymous referees. Special thanks are due to Ulas Ayaz and 
Felix Krahmer who thoroughly read the entire text, and whose numerous comments led 
to significant improvements of this tutorial. 

5.2 Preliminaries 

5.2.1 Matrices and their singular values 

The main object of our study will be an x n matrix A with real or complex entries. 
We shall state all results in the real case; the reader will be able to adjust them to the 
complex case as well. Usually but not always one should think of tall matrices A, those 
for which N > n > 1. By passing to the adjoint matrix A* , many results can be carried 
over to "flat" matrices, those for which N < n. 

It is often convenient to study A through the n x n symmetric positive-semidefinite 
matrix the matrix A* A. The eigenvalues of \A\ :— %/ A* A are therefore non-negative real 
numbers. Arranged in a non-decreasing order, they are called the singular valued of 
A and denoted si{A) > • • • > Sn{A) > 0. Many applications require estimates on the 
extreme singular values 

Smax(^) := Si{A), Sinin(^) := Sn(^)- 
^In the literature, singular values are also called s-numbers. 
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The smallest singular value is only of interest for tall matrices, since for iV < n one 
automatically has Si„in(^) = 0. 

Equivalently, Smax(^) and Smin(^) are respectively the smallest number M and the 
largest number m such that 

m||a;||2 < ||At||2 < Af||a;||2 for all a; G R". (5.3) 

In order to interpret this definition geometrically, we look at A as a linear operator 
from R" into R^. The Euclidean distance between any two points in M" can increase 
by at most the factor Smax(^) and decrease by at most the factor Sniax(^) under the 
action of A. Therefore, the extreme singular values control the distortion of the Euclidean 
geometry under the action of A. If Sinax(^) ~ SminC^) ~ 1 then A acts as an approximate 
isometry, or more accurately an approximate isometric embedding of £2 into £2 ■ 

The extreme singular values can also be described in terms of the spectral norm of 
A, which is by definition 

ll^ll = ll^ll^j^^r = sup sup \\Axh. (5.4) 

(j5.3p gives a link between the extreme singular values and the spectral norm: 

W(^) = \\A\l S,„in(A) 

where A^ denotes the pseudoinverse of A; if A is invertible then A^ = A^^. 
5.2.2 Nets 

Nets are convenient means to discretize compact sets. In our study we will mostly need 
to discretize the unit Euclidean sphere S*""^ in the definition of the spectral norm (j5.4p . 
Let us first recall a general definition of an e-net. 

Definition 5.1 (Nets, covering numbers). Let {X,d) be a metric space and let e > 0. 
A subset A/'e of X is called an e-net of X if every point x £ X can be approximated to 
within e by some point y G Me, i-c. so that d{x,y) < e. The minimal cardinality of an 
e-net of X , if finite, is denoted J\f{X,e) and is called the covering numbei[f| of X (at 
scale e ). 

From a characterization of compactness we remember that X is compact if and only 
a Af{X,e) < 00 for each e > 0. A quantitative estimate on JV{X,e) would give us a 
quantitative version of compactness of X^ Let us therefore take a simple example of 
a metric space, the unit Euclidean sphere S"~^ equipped with the Euclidean metric^ 
d{x, y) — \\x — y\\2, and estimate its covering numbers. 

^Equivalently, Af{X, e) is the minimal number of balls with radii e and with centers in X needed to 
cover X. 

^In statistical learning theory and geometric functional analysis, logA^(X, e) is called the metric 
entropy of X. In some sense it measures the "complexity" of metric space X. 

®A similar result holds for the geodesic metric on the sphere, since for small e these two distances 
are equivalent. 
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Lemma 5.2 (Covering numbers of the sphere). The unit Euclidean sphere ^ equipped 
with the Euclidean metric satisfies for every e > that 

(O \ n 
1 + . 

Proof. This is a simple volume argument. Let us fix e > and choose Afe to be a maximal 
e-separated subset of S^~^. In other words, J\f^ is such that d{x^ y) > £ for all x,y £ Ne, 
X ^ y, and no subset of S""^^ containing Me has this propertyjj 

The maximality property implies that Me is an e-net of S^~^. Indeed, otherwise 
there would exist x G 5*"^^ that is at least e-far from all points in A^. So Me U {x} 
would still be an e-separated set, contradicting the minimality property. 

Moreover, the separation property implies via the triangle inequality that the balls of 
radii e/2 centered at the points in Me are disjoint. On the other hand, all such balls lie in 
(l + e/2)B2 where denotes the unit Euclidean ball centered at the origin. Comparing 
the volume gives vol (|B^) ■\Me\< vol ((1 + |)BJ). Since vol {rBJ^) = r" yo\{B1^) for 
all r > 0, we conclude that |7V;| < (1 + §)"/(§)" = (1 + f )" as required. □ 

Nets allow us to reduce the complexity of computations with linear operators. One 
such example is the computation of the spectral norm. To evaluate the spectral norm by 
definition (|5.4p one needs to take the supremum over the whole sphere 5*""^. However, 
one can essentially replace the sphere by its e-net: 

Lemma 5.3 (Computing the spectral norm on a net). Let A he an N x n matrix, and 
let Me he an e-net of S"^^ for some e G [0, 1). Then 

max ||yla;||2 < ||A|| < (1 - e)"^ max \\Ax\\2 

Proof. The lower bound in the conclusion follows from the definition. To 
upper bound let us fix a; G S^^^ for which — \\Ax\\2, and choose y G 
approximates x as ||a:; — ?/||2 < £■ By the triangle inequality we have \\Ax 
ll^lllla;- 2/112 <e||^||- It follows that 

\\Ayh>\\Axh-\\Ax-Ayh>\\A\\-e\\A\\ = {l-e)\\A\\. 

Taking maximum over all y G Me in this inequality, we complete the proof. □ 

A similar result holds for symmetric n x n matrices A, whose spectral norm can be 
computed via the associated quadratic form: = supj.g5,i-i \ {Ax, x)\. Again, one can 
essentially replace the sphere by its e-net: 

Lemma 5.4 (Computing the spectral norm on a net). Let A be a symmetric n x n 
matrix, and let Me be an e-net of S""^^ for some e G [0, 1). Then 

\\A\\^ sup < (1 - 2e)"^ sup |(Aa;,a;)|. 

^One can in fact construct inductively by first selecting an arbitrary point on the sphere, and at 
each next step selecting a point that is at distance at least e from those already selected. By compactness, 
this algorithm will terminate after finitely many steps and it will yield a set Me as we required. 



prove the 
Me which 

- Ay\\2 < 
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Proof. Let us choose x € S'^ ^ for which \\A\\ = \{Ax,x)\, and choose y £ Afe which 
approximates a;as||a; — y||2<e. By the triangle inequaUty we have 

\{Ax,x) - {Ay,y)\ = \{Ax,x^y) + {A{x-y),y)\ 

< P||||x||2||a; -y\\2 + \\A\\\\x - yhWyh < 2e\\A\\. 

It foUows that \{Ay,y)\ > \{Ax,x)\ - 2e\\A\\ = (1 - 2e)||A||. Taking the maximum over 
all y € A/'e in this inequality completes the proof. □ 

5.2.3 Sub-gaussian random variables 

In this section we introduce the class of sub-gaussian random variables]! those whose 
distributions are dominated by the distribution of a centered gaussian random variable. 
This is a convenient and quite wide class, which contains in particular the standard 
normal and all bounded random variables. 

Let us briefly recall some of the well known properties of the standard normal ran- 
dom variable X. The distribution of X has density -^e"^ and is denoted N{0, 1). 
Estimating the integral of this density between t and oo one checks that the tail of a 
standard normal random variable X decays super-exponentially: 

¥{\X\>t}^^ e-"'/2ci:E<2e"*'/2, t>l, (5.5) 



/2tt 

see e.g. [261 Theorem 1.4] for a more precise two-sided inequality. The absolute moments 
of X can be computed as 

The moment generating function of X equals 

Eexp(fX) = e*'/^ teR. (5.7) 

Now let X be a general random variable. We observe that these three properties are 
equivalent - a super-exponential tail decay like in (|5.5p . the moment growth (j5.6|) . and 
the growth of the moment generating function like in (j5.7l) . We will then focus on the 
class of random variables that satisfy these properties, which we shall call sub-gaussian 
random variables. 

Lemma 5.5 (Equivalence of sub-gaussian properties). Let X be a random variable. 
Then the following properties are equivalent with parameters Ki > differing from each 
other by at most an absolute constant factor^ 



^It would be more rigorous to say that we study sub-gaussian probability distributions. The same 
concerns some other properties of random variables and random vectors we study later in this text. 
However, it is convenient for us to focus on random variables and vectors because we will form random 
matrices out of them. 

®The precise meaning of this equivalence is the following. There exists an absolute constant C such 
that property i implies property j with parameter Kj < CKi for any two properties i,j = 1, 2,3. 
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1. Tails: ¥{\X\ > t} < exp(l - f^/K^) for all t > 0; 

2. Moments: {E\X\Py^P < K2y/p for all p > 1; 

3. Super- exponential moment: Eexp(X^/_R'|) < e. 

Moreover, ifKX = then properties 1-3 are also equivalent to the following one: 

4- Moment generating function: Eexp(tX) < e'xjp{t^ K^) for all f G M. 

Proof. 1. 2. Assume property 1 holds. By homogeneity, rescaling X to X/Ki we can 
assume that Ki = 1. Recall that for every non- negative random variable Z, integration 
by parts yields the identity EZ = V{Z > u} du. We apply this for Z = \X\p. After 
change of variables u = , we obtain using property 1 that 

m\^^j^ n\x\>t)ptp-'dt< ei-*>-idi = (|)r(|)<(|)(|r/i 

Taking the p-th root yields property 2 with a suitable absolute constant K2. 

2. 3. Assume property 2 holds. As before, by homogeneity we may assume that 
K2 = 1. Let c > be a sufficiently small absolute constant. Writing the Taylor series of 
the exponential function, we obtain 

EeMcx') - 1 + E ^ 1 + E ^ 1 + E(2c/e)'- 

p=i ^' p=i ^' p=i 

The first inequality follows from property 2; in the second one we use pi > {p/e)P. For 
small c this gives Eexp(cA'^) < e, which is property 3 with = c~^/^. 

3. ^1. Assume property 3 holds. As before we may assume that K3 = 1. Expo- 
nentiating and using Markov's inequalitj0 and then property 3, we have 

P{|A:| >t} = P{e^^ > e*'} < e"*'Ee-^' < e^"*'. 

This proves property 1 with Ki — 1. 

2. => 4. Let us now assume that EA" = and property 2 holds; as usual we 
can assume that K2 = 1. We will prove that property 4 holds with an appropriately 
large absolute constant C — K4. This will follow by estimating Taylor series for the 
exponential function 



p=2 ^' p=2 ^' p=2 



The first inequality here follows from EA = and property 2; the second one holds since 
> {pI^Y- We compare this with Taylor's series for 



fe=l k=l ^ pe2N 



^This simple argument is sometimes called exponential Markov's inequality. 
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The first inequality here holds because p\ < p^; the second one is obtained by substitution 
p = 2k. One can show that the series in (|5.8I) is bounded by the series in (|5.9p with large 
absolute constant C. We conclude that Eexp(tX) < exp(C^t^), which proves property 4. 

4. ^1. Assume property 4 holds; we can also assume that K4 = 1. Let A > be a 
parameter to be chosen later. By exponential Markov inequality, and using the bound 
on the moment generating function given in property 4, we obtain 

F{X >t}^ F{e^^ > e^*} < e~^*Ee^^ < e~^*+^'. 

Optimizing in A and thus choosing A = t/2 we conclude that F{X > t} < e~* 
Repeating this argument for ~X, we also obtain ¥{X < —t} < e~* Combining these 
two bounds we conclude that F{\X\ > t} < 2e"* < e^"* Thus property 1 holds 
with Ki — 2. The lemma is proved. □ 

Remark 5.6. 1. The constants 1 and e in properties 1 and 3 respectively are chosen 
for convenience. Thus the value 1 can be replaced by any positive number and the 
value e can be replaced by any number greater than 1. 

2. The assumption EX = is only needed to prove the necessity of property 4; the 
sufficiency holds without this assumption. 

Definition 5.7 (Sub-gaussian random variables). A random variable X that satisfies 
one of the equivalent properties 1 - 3 in Lemma 15. 51 is called a sub-gaussian random 
variable. The sub-gaussian norm of X , denoted \\X\\^r^, is defined to he the smallest K2 
in property 2. In other wordsV^ 

\\X\U,,=snvp-"\E\Xrf'^. 
p>i 

The class of sub-gaussian random variables on a given probability space is thus a 
normed space. By Lemma 15.51 every sub-gaussian random variable X satisfies: 

F{\X\>t} <eM^-ct^/\\X\\l^) foraUi>0; (5.10) 

(E\X\Py^P < \\X\U„^ for all p>l; (5.11) 
Eexp{cXy\\X\\l^)<e- 

if EX = then Eexp{tX) < exp{Ct'^\\X\\lJ for all t e K, (5.12) 

where C, c > are absolute constants. Moreover, up to absolute constant factors, 
is the smallest possible number in each of these inequalities. 

Example 5.8. Classical examples of sub-gaussian random variables are Gaussian, Bernoulli 
and all bounded random variables. 

1. (Gaussian): A standard normal random variable X is sub-gaussian with ||X||^2 < 
C where C is an absolute constant. This follows from (|5.6|) . More generally, if X is 
a centered normal random variable with variance , then X is sub-gaussian with 



^Thc sub-gaussian norm is also called ip2 norm in the literature. 
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2. (Bernoulli): Consider a random variable X with distribution ]P{X = —1} = 
¥{X = 1} = 1/2. We call X a symmetric Bernoulli random variable. Since 
\X\ = 1, it follows that X is a sub-gaussian random variable with ||X||^2 — 1- 

3. (Bounded): More generally, consider any bounded random variable X, thus |^| < 
AI almost surely for some M. Then X is a sub-gaussian random variable with 

— We can write this more compactly as ||^|li/)2 < ll-^lloo- 

A remarkable property of the normal distribution is rotation invariance. Given a 
finite number of independent centered normal random variables Xi, their sum y^^- Xj 
is also a centered normal random variable, obviously with Var(^^Xi) = '^j^^SLi^Xi). 
Rotation invariance passes onto sub-gaussian random variables, although approximately: 

Lemma 5.9 (Rotation invariance). Consider a finite number of independent centered 
sub-gaussian random variables Xi. Then X^i ^^^'^ ^ centered sub-gaussian random 

variable. Moreover, 



i|li/J2 



where C is an absolute constant. 

Proof. The argument is based on estimating the moment generating function. Using 
independence and (15.121) we have for every t € M: 

E exp ^ X,) = E [] exp(iXO = [] E cxp{tX,) < exp{Ct^\\X,\\l^ ) 



= exp{t^K^) where iC^ = C ^ ||X 



2 

ill 1^2 



Using the equivalence of properties 2 and 4 in Lemma [5. 5 1 we conclude that || -^^4 111/12 ^ 
CiK where Ci is an absolute constant. The proof is complete. □ 

The rotation invariance immediately yields a large deviation inequality for sums of 
independent sub-gaussian random variables: 

Proposition 5.10 (Hoeffding-type inequality). Let Xi, . . . , Xi\j be independent cen- 
tered sub-gaussian random variables, and let K = max^ Then for every a = 
(fli, . . . , a^r) G and every t > 0, we have 

ct^ 



»|| ^a^Xj > i| < e • exp 



where c > is an absolute constant. 

Proof. The rotation invariance fLemma l5.9l) implies the bound || a,X || |^ — C'X^i ll^dl^a — 
Cif^ljajlj. Property (IS.lOp yields the required tail decay. □ 
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Remark 5.11. One can interpret these results fLemma 15.91 and Proposition I5.10| ) as 
one-sided non- asymptotic manifestations of the central limit theorem. For example, con- 
sider the normalized sum of independent symmetric Bernoulli random variables Sn = 
-^Y^^=i^'i- Proposition 15.101 yields the tail bounds P{|S'Ar| > t} < e ■ e~'^*^ for any 
number of terms TV. Up to the absolute constants e and c, these tails coincide with those 
of the standard normal random variable (15. 5p . 

Using moment growth (|5.1ip instead of the tail decay (|5.10l) , we immediately obtain 
from Lemma 15.91 a general form of the well known Khintchine inequality: 

Corollary 5.12 (Khintchine inequality). Let Xi be a finite number of independent sub- 
gaussian random variables with zero mean, unit variance, and ||^i||^2 ^ Then, for 
every sequence of coefficients Oi and every exponent p > 2 we have 

( ^ a^Y^' < (E| aa. n < CKVp ( E ^f" 

i i i 

where C is an absolute constant. 

Proof. The lower bound follows by independence and Holder's inequality: indeed, (E| OiXi |^) 

(E| 2- aiXi'^Y^'^ = ( ^f)^^^ ■ For the upper bound, we argue as in Proposition l5.10| 
but use property (|5.1ip . □ 

5.2.4 Sub-exponential random variables 

Although the class of sub-gaussian random variables is natural and quite wide, it leaves 
out some useful random variables which have tails heavier than gaussian. One such 
example is a standard exponential random variable - a non-negative random variable 
with exponential tail decay 

¥{X >t} = e-\ t > 0. (5.13) 

To cover such examples, we consider a class of sub- exponential random variables, those 
with at least an exponential tail decay. With appropriate modifications, the basic proper- 
ties of sub-gaussian random variables hold for sub-exponentials. In particular, a version 
of Lemma 15.51 holds with a similar proof for sub-exponential properties, except for prop- 
erty 4 of the moment generating function. Thus for a random variable X the following 
properties are equivalent with parameters Ki > differing from each other by at most 



an absolute constant factor: 

F{\X\ >t} < cxp(l - t/R'i) for aU t > 0; (5.14) 

{E\X\P)^^P < K2P forallp>l; (5.15) 

Ec^piX/Ks) <e. (5.16) 



Definition 5.13 (Sub-exponential random variables). A random variable X that satis- 
fies one of the equivalent properties (|5.14p - (|5.16|) is called a sub-exponential random 
variable. The sub-exponential norm of X , denoted is defined to be the smallest 

parameter K2 . In other words, 

\\XU,^snpp-\E\X\Py^P. 
p>i 
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Lemma 5.14 (Sub-exponential is sub-gaussian squared). A random variable X is suh- 
gaussian if and only if is sub- exponential. Moreover, 

\\X\\l^<\\X^U,<2\\X\\l^. 

Proof. This follows easily from the definition. □ 

The moment generating function of a sub-exponential random variable has a similar 
upper bound as in the sub-gaussian case (property 4 in Lemma I5.5p . The only real 
difference is that the bound only holds in a neighborhood of zero rather than on the 
whole real line. This is inevitable, as the moment generating function of an exponential 
random variable ()5.13p does not exist for t > 1. 

Lemma 5.15 (Mgf of sub-exponential random variables). Let X be a centered sub- 
exponential random variable. Then, for t such that \t\ < c/\\X\\^-^, one has 

Eexp(iX) < exp(Ci2||X||^J 

where C, c > are absolute constants. 

Proof. The argument is similar to the sub-gaussian case. We can assume that = 1 

by replacing X with and t with Repeating the proof of the implication 

2 4 of Lemma [5.51 and using E|X|p < pP this time, we obtain that Eexp(tX) < 1 -I- 
E^2(e|*l)^- If \t\ < l/2e then the right hand side is bounded by l-\-2eH^ < ex.p{2eH^). 
This completes the proof. □ 

Sub-exponential random variables satisfy a large deviation inequality similar to the 
one for sub-gaussians fProDOsition l5.10)) . The only significant difference is that two tails 
have to appear here - a gaussian tail responsible for the central limit theorem, and an 
exponential tail coming from the tails of each term. 

Proposition 5.16 (Bernstein-type inequality). Let Xi, . . . , Xn be independent cen- 
tered sub-exponential random variables, and K ~ max^ . Then for every a = 
(fli, . . . , a^r) S and every t > 0, we have 

e t 



1 1 ttiXi > <| < 2 exp — c min ^ 



i=l 

where c > is an absolute constant. 



K a 



Proof. Without loss of generality, we assume that K = 1 hy replacing Xi with Xi/K 
and t with t/K. We use the exponential Markov inequality for the sum S = '^^UiXi 
and with a parameter A > 0: 

P{S > = lP{e^^ > e^*} < e-^*Ee^^ = e"^* J|Eexp(Aa,X,). 

i 

If |A| < c/||a||oo then |Aai| < c for all i, so Lemma [5.151 yields 

F{S >t}< e"^* Y[exp{CX^af) = exp(-At + CX^\\a\\l). 
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Choosing A = niin(t/2C||a|l2, c/||a|loo), we obtain that 

ct 



"{S >t} < exp 



4C||a||^' 2\\a 



Repeating this argument for —Xi instead of Xi, we obtain the same bound for P{— 5* > 
t}. A combination of these two bounds completes the proof. □ 

Corollary 5.17. Let Xi, . . . ,Xn be independent centered sub- exponential random vari- 
ables, and let K — max^ . Then, for every £ > 0, we have 



N 

> eTvj < 2 exp 



-2 



1=1 

where c > is an absolute constant. 



c min ( — - , — ]N 



Proof. This foUows from Proposition 15. 161 for = 1 and t — eN. □ 

Remark 5.18 (Centering). The definitions of sub-gaussian and sub-exponential random 
variables X do not require them to be centered. In any case, one can always center X 
using the simple fact that if X is sub-gaussian (or sub-exponential), then so is X — EX. 
Moreover, 

\\X-EX\\^„ <2||X||^,, \\X-EX\\.^, <2||X||^,. 

This follows by triangle inequality EXjl^^ < ||X||^2 + ||]EX|j^2 along with ||EX||y,2 = 
\KX\ < E\X\ < \\X\\^2, and similarly for the sub-exponential norm. 



5.2.5 Isotropic random vectors 

Now we carry our work over to higher dimensions. We will thus be working with random 
vectors X in R", or equivalently probability distributions in R". 

While the concept of the mean /x = KZ of a random variable Z remains the same 
in higher dimensions, the second moment EZ^ is replaced by the n x n second moment 
matrix of a random vector X, defined as 

S = S(X) =EX(E)X = EXX'^ 

where (8) denotes the outer product of vectors in R". Similarly, the concept of variance 
Var(Z) = E{Z — = ^Z"^ — /i'^ of a random variable is replaced in higher dimensions 
with the covariance matrix of a random vector X, defined as 

Cov(X) E{X - ^i) (X) {X - fj.) ^ EX X - fi ^ fj. 

where fi = EX. By translation, many questions can be reduced to the case of centered 
random vectors, for which /i = and Cov(X) = E(X). We will also need a higher- 
dimensional version of unit variance: 

Definition 5.19 (Isotropic random vectors) . A random vector X mR" is caZZed isotropic 
ifT,{X) = I. Equivalently, X is isotropic if 



E{X,x)^ = \\x\\l forallxeW\ 



(5.17) 
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Suppose is an invertible matrix, which means that the distribution of X is not 

essentiahy supported on any proper subspace of M". Then 'E,{X)^^/^X is an isotropic 
random vector in R". Thus every non-degenerate random vector can be made isotropic 
by an appropriate hnear transformationo This allows us to mostly focus on studying 
isotropic random vectors in the future. 

Lemma 5.20. Let X, Y be independent isotropic random vectors in M". Then E||X||2 = 
n and K(X,Yy = n. 

Proof. The first part follows from E\\X\\l = Etr{X (S) X) = tr(EX (g) X) = tr(/) = n. 
The second part follows by conditioning on Y, using isotropy of X and using the first 
part for Y: this way we obtain E{X, Y)'^ = E||y||| = n. □ 

Example 5.21. 1. (Gaussian): The (standard) Gaussian random vector X in R" 
chosen according to the standard normal distribution N{0,I) is isotropic. The 
coordinates of X are independent standard normal random variables. 

2. (Bernoulli): A similar example of a discrete isotropic distribution is given by a 
Bernoulli random vector X in R" whose coordinates are independent symmetric 
Bernoulli random variables. 

3. (Product distributions): More generally, consider a random vector X in R" 
whose coordinates are independent random variables with zero mean and unit 
variance. Then clearly X is an isotropic vector in M" . 

4. (Coordinate): Consider a coordinate random vector X, which is uniformly dis- 
tributed in the set {y/n ei} f^i where {e^^f^i is the canonical basis of R". Clearly 
X is an isotropic random vector in R" 

5. (Frame): This is a more general version of the coordinate random vector. A frame 
is a set of vectors {ui}f£i in R" which obeys an approximate Parseval's identity, 
i.e. there exist numbers A, B > called frame bounds such that 

M 

M^Wl < '^{uz,xf < B\\x\\l for all x S R". 

i=l 

li A — B the set is called a tight frame. Thus, tight frames are generalizations of 
orthogonal bases without linear independence. Given a tight frame {Mi}f£i with 
bounds A = B = M, the random vector X uniformly distributed in the set {ui}ff^ 
is clearly isotropic in R"0 



^^This transformation (usually preceded by centering) is a higher-dimensional version of standardizing 
of random variables, which enforces zero mean and unit variance. 

^^The examples of Gaussian and coordinate random vectors are somewhat opposite - one is very 
continuous and the other is very discrete. They may be used as test cases in our study of random 
matrices. 

^''There is clearly a reverse implication, too, which shows that the class of tight frames can be identified 
with the class of discrete isotropic random vectors. 
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6. (Spherical): Consider a random vector X uniformly distributed on the unit Eu- 
clidean sphere in M" with center at the origin and radius ^/n. Then X is isotropic. 
Indeed, by rotation invariance E(X, x)^ is proportional to ||x||2; the correct nor- 
malization ^Jn is derived from Lemma 15.201 

7. (Uniform on a convex set): In convex geometry, a convex set K in R" is 
called isotropic if a random vector X chosen uniformly from K according to the 
volume is isotropic. As we noted, every full dimensional convex set can be made 
into an isotropic one by an afRne transformation. Isotropic convex sets look "well 
conditioned" , which is advantageous in geometric algorithms (e.g. volume compu- 
tations) . 

We generalize the concepts of sub-gaussian random variables to higher dimensions 
using one-dimensional marginals. 

Definition 5.22 (Sub-gaussian random vectors). We say that a random vector X in 
M" is sub-gaussian if the one- dimensional marginals (X, x) are sub-gaussian random 
variables for all x G M" . The sub-gaussian norm of X is defined as 



Remark 5.23 (Properties of high-dimensional distributions). The definitions of isotropic 
and sub-gaussian distributions suggest that more generally, natural properties of high- 
dimensional distributions may be defined via one-dimensional marginals. This is a nat- 
ural way to generalize properties of random variables to random vectors. For example, 
we shall call a random vector sub-exponential if all of its one-dimensional marginals are 
sub-exponential random variables, etc. 

One simple way to create sub-gaussian distributions in M" is by taking a product of 
n sub-gaussian distributions on the line: 

Lemma 5.24 (Product of sub-gaussian distributions). Let Xi, . . . ,Xn be independent 
centered sub-gaussian random variables. Then X = (Xi, . . . , Xn) is a centered sub- 
gaussian random vector in M", and 



where C is an absolute constant. 

Proof. This is a direct consequence of the rotation invariance principle. Lemma 15.91 
Indeed, for every x = (xi, . . . , x„) G S"~^ we have 




sup \\{X,x)\\^^. 



\\X\\^^ < C max||Xi||^2 



n n 



\\{X,x)U, 



Y^x.X, <C^x2||X,||2 <Cmax||X,||v„ 

' ' ihr, * ^ ^ ?<'n 



where we used that 



xf = 1. This completes the proof. 



□ 



'1=1 -^i 



Example 5.25. Let us analyze the basic examples of random vectors introduced earlier 
in Example 15.211 
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1 . (Gaussian, Bernoulli) : Gaussian and Bernoulli random vectors are sub-gaussian; 
their sub-gaussian norms are bounded by an absolute constant. These are partic- 
ular cases of Lemma 15.241 

2. (Spherical): A spherical random vector is also sub-gaussian; its sub-gaussian 
norm is bounded by an absolute constant. Unfortunately, this docs not follow 
from Lemma l5 . 241 because the coordinates of the spherical vector are not indepen- 
dent. Instead, by rotation invariance, the claim clearly follows from the following 
geometric fact. For every £ > 0, the spherical cap {x G S*""^ : xi > e} makes 
up at most exp(— e^7T,/2) proportion of the total area on the sphereF^ This can be 
proved directly by integration, and also by elementary geometric considerations [9l 
Lemma 2.2]. 

3. (Coordinate) : Although the coordinate random vector X is formally sub-gaussian 
as its support is finite, its sub-gaussian norm is too big: ||^||v'2 — ^ 1- 
would not think of X as a sub-gaussian random vector. 

4. (Uniform on a convex set): For many isotropic convex sets K (called -02 bodies), 
a random vector X uniformly distributed in K is sub-gaussian with = '^(1)- 
For example, the cube [—1, 1]" is a -02 body by Lemma r5.24l while the appropriately 
normalized cross-polytope {x G R" : ||x|ji < M} is not. Nevertheless, Borell's 
lemma (which is a consequence of Brunn-Minkowski inequality) implies a weaker 
property, that X is always sub- exponential, and = sup^g5„-i a;)||^j is 
bounded by absolute constant. See [33l Section 2.2.b3] for a proof and discussion 
of these ideas. 

5.2.6 Sums of independent random matrices 

In this section, we mention without proof some results of classical probability theory 
in which scalars can be replaced by matrices. Such results are useful in particular for 
problems on random matrices, since we can view a random matrix as a generalization of 
a random variable. One such remarkable generalization is valid for Khintchine inequality, 
CoroUarv 15.121 The scalars can be replaced by matrices, and the absolute value by 
the Schatten norm. Recall that for 1 < p < oo, the p-Schatten norm of an n x n matrix 
A is defined as the £p norm of the sequence of its singular values: 

n 

\\A\\c^ = \\{.%{Ami\\p^{J2'^('^yf'- 

For p = 00, the Schatten norm equals the spectral norm ||yl|| = maxi<„Si(A). Using 
this one can quickly check that already for p = log n the Schatten and spectral norms 
are equivalent: ||A||cn < ||A|| < e||yl||c". 



^^This fact about spherical caps may seem counter-intuitive. For example, for e = 0.1 the cap looks 
similar to a hemisphere, but the proportion of its area goes to zero very fast as dimension n increases. 
This is a starting point of the study of the concentration of measure phenomenon, see |43| . 
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Theorem 5.26 (Non-commutative Khintchine inequality, see Section 9.8). Let 
Ai, . . . , Aff be self-adjoint n x n matrices and ei,...,eN be independent symmetric 
Bernoulli random variables. Then, for every 2 < p < oo, we have 



N 



1/2 



< 



N 



P \ 1/p 



< 



N 



1/2 



i=l 



where C is an absolute constant. 



Remark 5.27. 1. The scalar case of this resuh, for n = 1, recovers the classical Khint- 
chine inequality, Corollarv l5.12[ for Xi = £i. 

2. By the equivalence of Schatten and spectral norms for p = logn, a version of 
non-commutative Khintchine inequality holds for the spectral norm: 



E 



N N 



1/2 



(5.18) 



where Ci is an absolute constant. The logarithmic factor is unfortunately essential; 
it role will be clear when we discuss applications of this result to random matrices 
in the next sections. 

Corollary 5.28 (Rudelson's inequality 65 ). Letxi, . . . ,X]^ be vectors mR" andei, . . . ,£jv 
be independent symmetric Bernoulli random variables. Then 



N 



N 



E| ^EiXi (g) Xi 
i=l 



< C \/log inin(A^, n) • max 1 1 1 1 2 • 1 1 Xi (8) x 



i=l 



1/2 



where C is an absolute constant. 



Proof. One can assume that n < iV by replacing R" with the linear span of {xi, . . . , xat} 
if necessary. The claim then follows from (|5.18p . since 



N 

|(E(a;, (8)Xi)^) 



1/2 



N 

El 



II 2 y9 Xi 



1/2 



N 



< max||xj||2 



1/2 



□ 



Ahlswede and Winter [4] pioneered a different approach to matrix-valued inequali- 
ties in probability theory, which was based on trace inequalities like Golden- Thompson 
inequality. A development of this idea leads to remarkably sharp results. We quote one 
such inequality from |77j : 

Theorem 5.29 (Non-commutative Bernstein-type inequality [77]). Consider a finite 
sequence Xi of independent centered self-adjoint random n x n matrices. Assume we 
have for some numbers K and a that 

\\X^\\ < K almost surely, \\^¥.X'^\\ < a^. 
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Then, for every t > we have 

p{E^.||>4<2n.exp(-,Z^). (5.19) 

Remark 5.30. This is a direct matrix generalization of a classical Bernstein's inequality 
for bounded random variables. To compare it with our version of Bernstein's inequal- 
ity for sub-exponentials, Proposition 15.161 note that the probability bound in ()5.19p is 
equivalent to 2n ■ exp [ — cmin ■^)] where c > is an absolute constant. In both 
results we see a mixture of gaussian and exponential tails. 

5.3 Random matrices with independent entries 

We are ready to study the extreme singular values of random matrices. In this section, 
we consider the classical model of random matrices whose entries are independent and 
centered random variables. Later we will study the more difficult models where only the 
rows or the columns are independent. 

The reader may keep in mind some classical examples of N x n random matrices 
with independent entries. The most classical example is the Gaussian random matrix 
A whose entries are independent standard normal random variables. In this case, the 
nx n symmetric matrix A* A is called Wishart matrix; it is a higher-dimensional version 
of chi-square distributed random variables. 

The simplest example of discrete random matrices is the Bernoulli random matrix A 
whose entries are independent symmetric Bernoulli random variables. In other words, 
Bernoulli random matrices are distributed uniformly in the set of all N x n matrices 
with ±1 entries. 

5.3.1 Limit laws and Gaussian matrices 

Consider slu N x n random matrix A whose entries are independent centered identically 
distributed random variables. By now, the limiting behavior of the extreme singular 
values of A, as the dimensions TV, n — > oo, is well understood: 

Theorem 5.31 (Bai- Yin's law, see [8]). Let A — A^^n be an N xn random matrix whose 
entries are independent copies of a random variable with zero mean, unit variance, and 
finite fourth moment. Suppose that the dimensions N and n grow to infinity while the 
aspect ratio n/N converges to a constant in [0, 1]. Then 

Smin(^) = %/TV — ^/n + o{y/n), Smax(^) = V^V + ^/n + o{^/n) almost surely. 

As we pointed out in the introduction, our program is to find non- asymptotic versions 
of Bai- Yin's law. There is precisely one model of random matrices, namely Gaussian, 
where an exact non- asymptotic result is known: 

Theorem 5.32 (Gordon's theorem for Gaussian matrices). Let A be an N x n matrix 
whose entries are independent standard normal random variables. Then 

ViV - ^AI < Es,nin(v4) < Es„iax(^) < + y/n. 



21 



The proof of the upper bound, which we borrowed from [21', is based on Slepian's 
comparison inequality for Gaussian processes 

Lemma 5.33 (Slepian's inequahty, see |45| Section 3.3). Consider two Gaussian pro- 
cesses {Xt)teT and (Yt)teT whose increments satisfy the inequality MIX s — Xt\'^ < E|K, — 
yt|2 for all s,teT. T/ien E sup^gg. < Esuptg-rF*. 

Proof of Theorem \5.3S[ We recognize Smax(^) — niax^,g5ii-i .yggw-i (Au, to be the 
supremum of the Gaussian process Xu^v = {Au, v) indexed by the pairs of vectors (u, v) G 
S'"~^ X S^~^. We shall compare this process to the following one whose supremum is 
easier to estimate: Y^^v = {g, u) + {h, v) where g G M" and h e are independent 
standard Gaussian random vectors. The rotation invariance of Gaussian measures makes 
it easy to compare the increments of these processes. For every {u,v), {u',v') G 5""^ x 
S'^^^, one can check that 

n N 

E\Xu,v - X^,y\^ = - uy^l^ < \\u - u'Wl + \\v - v'\\l = E|y„^„ - . 

i=i j=i 

Therefore Lemma 15.331 applies, and it yields the required bound 

IEs,nax(^) = EmaxX„,„ < Emaxy^,^, = EII5II2 + E||/i||2 < ^/N + ^fn. 

Similar ideas are used to estimate Esnii„(A) = Emax^g^w-i min„g5,i-i (Am, u), see fU] . 
One uses in this case Gordon's generalization of Slepian's inequality for minimax of 
Gaussian processes [351IM1I3Z]: see [IS Section 3.3]. □ 

While Theorem (5321 is about the expectation of singular values, it also yields a large 
deviation inequality for them. It can be deduced formally by using the concentration of 
measure in the Gauss space. 

Proposition 5.34 (Concentration in Gauss space, see [43]). Let f be a real valued 
Lipschitz function on R" with Lipschitz constant K , i.e. \f{x) — f{y)\ < K\\x — y\\2 for 
all x,y £ R" (such functions are also called K -Lipschitz). Let X be the standard normal 
random vector in R" . Then for every t > one has 

P{f{X) - Ef{X) >t}< cxp(-iV2if2). 

Corollary 5.35 (Gaussian matrices, deviation; see PI])- Let A be an N x n matrix 
whose entries are independent standard normal random variables. Then for every t > 0, 
with probability at least 1 — 2exp(— 1^/2) one has 

VN - y^-t < S,„i„(A) < Smax(A) < %/]V + + i. 

Proof. Note that Sniin(A), Sniax(A) are 1-Lipschitz functions of matrices A considered 
as vectors in R^". The conclusion now follows from the estimates on the expectation 
(Theorem 15. 32p and Gaussian concentration (Proposition 15 . 34p . □ 



^® Recall that a Gaussian process {Xt)teT is a collection of centered normal random variables Xt on 
the same probability space, indexed by points t in an abstract set T. 
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Later in these notes, we find it more convenient to work with the n x n positive- 
definite symmetric matrix A* A rather than with the original N x n matrix A. Observe 
that the normalized matrix A = "^^^ is an approximate isometry (which is our goal) if 
and only if A* A is an approximate identity: 

Lemma 5.36 (Approximate isometrics). Consider a matrix B that satisfies 

\\B*B - I\\ < max((5, S^) (5.20) 

for some S > 0. Then 

i~S< S„,i„(B) < S,nax(S) <1+S. (5.21) 

Conversely, if B satisfies (j5.21l) for some S > then \\B*B — I\\ < 3 max((5, (5^). 

Proof. Inequality (|5.20p holds if and only if |||-Ba;||2 ^ l| ^ max((5, J^) for all x e 5""^. 
Similarly, (|5.2ip holds if and only if |||_Ba;||2 — l| < (5 for all x G S*"^^. The conclusion 
then follows from the elementary inequality 

max(|.z-l|,|z-lp) < <3inax(|z-l|,|z-l|2) for all z > 0. □ 

Lemma 15.361 reduces our task of proving inequalities ()5.2p to showing an equivalent 
(but often more convenient) bound 

<max((5,(52) where 6 = 0{y^n/N). 

5.3.2 General random matrices with independent entries 

Now we pass to a more general model of random matrices whose entries are independent 
centered random variables with some general distribution (not necessarily normal). The 
largest singular value (the spectral norm) can be estimated by Latala's theorem for 
general random matrices with non-identically distributed entries: 

Theorem 5.37 (Latala's theorem [42 ). Let A be a random matrix whose entries Oij 
are independent centered random variables with finite fourth moment. Then 

Ksmax(A) < C [ max ( ^ ^4) + max ( ^ ^a^)'^^ + ( J2 ^4) 

If the variance and the fourth moments of the entries are uniformly bounded, then 
Latala's result yields Smax(^) — 0{\/N +^/n). This is slightly weaker than our goal (|5.2p . 
which is Smax(^) = + 0{^/n) but still satisfactory for most applications. Results 
of the latter type will appear later in the more general model of random matrices with 
independent rows or columns. 

Similarly, our goal (|5.2p for the smallest singular value is Smin(^) > VN — 0{^/n). 
Since the singular values are non-negative anyway, such inequality would only be useful 
for sufficiently tall matrices, N ^ n. For almost square and square matrices, estimating 
the smallest singular value (known also as the hard edge of spectrum) is considerably 
more difficult. The progress on estimating the hard edge is summarized in [69, . If A 
has independent entries, then indeed Smin(^) > c{^/N — V^): and the following is an 
optimal probability bound: 
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Theorem 5.38 (Independent entries, hard edge [BS]). Let A be an nxn random matrix 
whose entries are independent identically distributed suhgaussian random variables with 
zero mean and unit variance. Then for e > 0, 

where C > and c G (0,1) depend only on the subgaussian norm of the entries. 
This result gives an optimal bound for square matrices as well {N = n). 



5.4 Random matrices with independent rows 

In this section, we focus on a more general model of random matrices, where we only 
assume independence of the rows rather than all entries. Such matrices are naturally 
generated by high- dimensional distributions. Indeed, given an arbitrary probability dis- 
tribution in M", one takes a sample of N independent points and arranges them as the 
rows of an iV X n matrix A. By studying spectral properties of A one should be able to 
learn something useful about the underlying distribution. For example, as we will see 
in Section I5.4.3[ the extreme singular values of A would tell us whether the covariance 
matrix of the distribution can be estimated from a sample of size N . 

The picture will vary slightly depending on whether the rows of A are sub-gaussian or 
have arbitrary distribution. For heavy-tailed distributions, an extra logarithmic factor 
has to appear in our desired inequality (j5.2l) . The analysis of sub-gaussian and heavy- 
tailed matrices will be completely different. 

There is an abundance of examples where the results of this section may be use- 
ful. They include all matrices with independent entries, whether sub-gaussian such as 
Gaussian and Bernoulli, or completely general distributions with mean zero and unit 
variance. In the latter case one is able to surpass the fourth moment assumption which 
is necessary in Bai- Yin's law, Theorem 15.311 

Other examples of interest come from non-product distributions, some of which we 
saw in Example 15.211 Sampling from discrete objects (matrices and frames) fits well in 
this framework, too. Given a deterministic matrix _B, one puts a uniform distribution 
on the set of the rows of B and creates a random matrix A as before - by sampling some 
N random rows from B. Applications to sampling will be discussed in Section [5.4.41 



5.4.1 Sub-gaussian rows 

The following result goes in the direction of our goal (15.21) for random matrices with 
independent sub-gaussian rows. 

Theorem 5.39 (Sub-gaussian rows). Let A be an N x n matrix whose rows Ai are 
independent sub-gaussian isotropic random vectors in M". Then for every t > 0, with 
probability at least 1 — 2exp(— ci^) one has 

Vn ~ CVn - t < s,„i„(A) < Smax(A) <Vn + CVn -I- t. (5.22) 

Here C — C'k, c ^ ck > depend only on the subgaussian norm K — maxj ||^i||02 of 
the rows. 
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This result is a general version of Corollary 15.351 (up to absolute constants) ; in- 
stead of independent Gaussian entries we allow independent sub-gaussian rows. This of 
course covers all matrices with independent sub-gaussian entries such as Gaussian and 
Bernoulli. It also applies to some natural matrices whose entries are not independent. 
One such example is a matrix whose rows are independent spherical random vectors 
(Example [123 ■ 



Proof. The proof is a basic version of a covering argument, and it has three steps. We 
need to control || Aa;||2 for all vectors x on the unit sphere S"'~^. To this end, we discretize 
the sphere using a net Af (the approximation step), establish a tight control of ||>la;||2 
for every fixed vector x € Af with high probability (the concentration step), and finish 
off by taking a union bound over all x in the net. The concentration step will be based 
on the deviation inequality for sub-exponential random variables. Corollary 1 5. 171 

Step 1: Approximation. Recalling Lemma [5]36] for the matrix B = A/^/N we see 
that the conclusion of the theorem is equivalent to 

\\^A*A-I\\<m&yi(6,6^)=:e where S = cJ— + ^. (5.23) 



Using Lemma \5A[ we can evaluate the operator norm in (|5.23p on a -j-net A/" of the unit 
sphere 5""-^: 

II— J|| < 2rRax\((—A*A-I)x,x) \ = 2max| — |Ua;||? - l|. 

So to complete the proof it suffices to show that, with the required probability, 

1 e 
max| — IIAxllo - l| < -. 

By Lemma [5.21 we can choose the net JV so that it has cardinality \Af \ < 9". 

Step 2: Concentration. Let us fix any vector x E S"~^. We can express ||^a:^||2 as 
a sum of independent random variables 

N N 

Pa:||^ = ^(yl„x)2=:^Zf (5.24) 

i=l i=l 

where Ai denote the rows of the matrix A. By assumption, Zi — {Ai,x) are indepen- 
dent sub-gaussian random variables with EZ? = 1 and ||Zi||,/j2 < K. Therefore, by 
Remark 15.181 and Lemma [5. 141 — 1 are independent centered sub-exponential random 
variables with \\Z] - < 2\\Zf\\^, < ^\\Z,\,\^ < 4K\ 

We can therefore use an exponential deviation inequality, Corollarv l5.171 to control 
the sum (jOil) . Since K > \\Z,\\^^ > ^(E|Z,|2)i/2 ^ this gives 

1 1 ^ 

P{|^l|A^ll2-l| > f}=P{l]^E^'-l| ^ |} <2cxp[--|jmin(£2,£)Ar 

1=1 
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where the last inequahty follows by the definition of 5 and using the inequality (a + b)'^ > 
+ 6^ for a,b>0. 

Step 3: Union bound. Taking the union bound over all vectors x in the net JV of 
cardinality \J\f\ < 9", we obtain 



where the second inequality follows for C — Ck sufhciently large, e.g. C = JsT^-^/lnQ/ci. 
As we noted in Step 1, this completes the proof of the theorem. □ 

Remark 5.40 (Non-isotropic distributions). 1. A version of Theorem 15.391 holds for 
general, non-isotropic sub-gaussian distributions. Assume that A is an x n 
matrix whose rows Ai are independent sub-gaussian random vectors in M" with 
second moment matrix E. Then for every t > 0, the following inequality holds 
with probability at least 1 — 2exp(— ct^): 

II— ^M-Sll <max((5,(52) where S = cJ— + -!=. (5.25) 



Here as before C — Ck, c = ck > depend only on the subgaussian norm 
K = maxi II Ail I ^2 of the rows. This result is a general version of (|5.23p . It follows 
by a straighforward modification of the argument of Theorem 15.391 

A more natural, multiplicative form of (|5.25l) is the following. Assume that 
J]~^/^Ai are isotropic sub-gaussian random vectors, and let K be the maximum 
of their sub-gaussian norms. Then for every t > 0, the following inequality holds 
with probability at least 1 — 2exp(— ct^): 

\\— A* A~m <ma.x( 6, S'^)m\ where 5 = C\I^+-^ (5.26) 



Here again C — Ck, c — ck > 0. This result follows from Theorem 15.391 applied 
to the isotropic random vectors S^^/^Ai. 



5.4.2 Heavy-tailed rows 

The class of sub-gaussian random variables in Theorem 15.391 may sometimes be too 
restrictive in applications. For example, if the rows of A are independent coordinate 
or frame random vectors (Examples 15.211 and I5.25P , they are poorly sub-gaussian and 
Theorem 15.391 is too weak. In such cases, one would use the following result instead, 
which operates in remarkable generality. 

Theorem 5.41 (Heavy-tailed rows). Let A be an N x n matrix whose rows Ai are 
independent isotropic random vectors in M". Let m be a number such that \\Ai\\2 < \/m 
almost surely for all i. Then for every t > 0, one has 

VN - tyM < S,„i„(A) < Smax(A) <Vn + ty/m (5.27) 

with probability at least 1 — 2n • exp(— ct^), where c > is an absolute constant. 
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Recall that (E|lAi||2)"'^/^ = \fn by Lemma [5.201 This indicates that one would typi- 
cally use Theorem 15.411 with m = Oin). In this case the result takes the form 

\/]V - t^fT% < s,„i„(A) < s„iax(A) < + (5.28) 

with probability at least 1 — 2n • exp(— c't^). This is a form of our desired inequality ()5.2p 
for heavy-tailed matrices. We shall discuss this more after the proof. 



Proof. We shall use the non-commutative Bernstein's inequality, Theorem 15.291 

Step 1: Reduction to a sum of independent random matrices. Wc first note 
that m > n > 1 since by Lemma 15.201 we have E||y4i||2 = n. Now we start an argument 
parallel to Step 1 of Theorem 15.391 Recalling Lemma [5.361 for the matrix B — A/\/N 
we see that the desired inequalities ()5.27p are equivalent to 

||-^74*A- /|| < max((5,(52) =: £ where d = t^J^. (5.29) 

We express this random matrix as a sum of independent random matrices: 

—A* A - / = — ^ A, «) A; - / = ^ Xi, where X, := —{A, (» A, - I); 
1=1 1=1 

note that Xi are independent centered n x n random matrices. 

Step 2: Estimating the mean, range and variance. We are going to apply the 
non-commutative Bernstein inequality, Theorem 15.291 for the sum '^^Xi. Since Ai are 
isotropic random vectors, we have EAi Ai = I which implies that EXi = as required 
in the non-commutative Bernstein inequality. 

We estimate the range of Xi using that IIAJ2 < and m > 1: 

1 1 1 9m 

\m < ^(11 A, ® A,\\ + 1) = -{\\A,\\l + 1) < -(m + 1) < _ 

To estimate the total variance || J2i ^^fW^ ^I'st compute 

so using that the isotropy assumption EAi Ai — I we obtain 

EXf = ^ [E{A, (g, A,f - I] . (5.30) 

Since {Ai (g Ai)'^ — \\Ai\\2 Ai (E) Ai is a positive semi-definite matrix and || Ai||2 < m by 
assumption, we have ||E(Ai (g) Ai)^|| < m • \\EAi Ai\\ = m. Putting this into (|5.30p we 
obtain 

llEXfll < ^(m + 1) < ^ 
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where we again used that m > 1. This yieldfEll 



N 
i=l 



< A^- max hex; II = 



2 m 



Step 3: Application of the non-commutative Bernstein's inequality. Ap- 
plying Theorem l5.29l fsee Remark l5.30p and recalling the definitions of £ and S in (|5.29p . 
we we bound the probability in question as 



N 



< 2n 



-A*A~I 



N 

> e| = X^^'ll - ■ exp 



exp 



cmm(e^,£)- = 2n • exp = 2n • exp(-cr/2) 

2m J V 2m / 



e 

K 



This completes the proof. 



□ 



Theorem 15.411 for heavy-tailed rows is different from Theorem 15.391 for sub-gaussian 
rows in two ways: the boundedness assumptiorF^ II II 2 — appears, and the probabil- 
ity bound is weaker. We will now comment on both differences. 

Remark 5.42 (Boundedness assumption). Observe that some boundendess assumption 
on the distribution is needed in Tlieorem l5.41l Let us see this on the following example. 
Choose 6 > arbitrarily small, and consider a random vector X = S^^^^S^Y in R" where 
^ is a {0, l}-valued random variable with = S {a "selector") and Y is an independent 
isotropic random vector in ]R" with an arbitrary distribution. Then X is also an isotropic 
random vector. Consider an iV x n random matrix A whose rows Ai are independent 
copies of X. However, if (5 > is suitably small then ^ = with high probability, hence 
no nontrivial lower bound on Smin(^) is possible. 

Inequality (|5.28p fits our goal (|5.2p . but not quite. The reason is that the probability 
bound is only non-trivial if t > C^/log n. Therefore, in reality Theorem l5. 411 asserts that 

^/N - C^/n\ogn < Smi„(A) < Smax(A) < ^/n + C^/n\ogn (5.31) 

with probability, say 0.9. This achieves our goal (|5.2p up to a logarithmic factor. 
Remark 5.43 (Logarithmic factor). The logarithmic factor can not be removed from 
(|5.3ip for some heavy-tailed distributions. Consider for instance the coordinate distri- 
bution introduced in Example 15.211 In order that Smin(^) > there must be no zero 
columns in A. Equivalently, each coordinate vector ei, . . . , e„ must be picked at least 
once in N independent trials (each row of A picks an independent coordinate vector). 
Recalling the classical coupon collector's problem, one must make at least N > Cn log n 
trials to make this occur with high probability. Thus the logarithm is necessary in the 
left hand side of (jOTll f^ 

^^Here the seemingly crude application of triangle inequality is actually not so loose. If the rows Ai 
are identically distributed, then so are Xf, which makes the triangle inequality above into an equality. 

^^Going a little ahead, we would like to point out that the almost sure boundedness can be relaxed 
to the bound in expectation Emaxi ||v4i||| < m, see Theorem 1 5. 45 1 

^®This argument moreover shows the optimality of the probability bound in Theorem 15.411 For 
example, for t = y/N Il\/n the conclusion 115.281 1 implies that A is well conditioned (i.e. \/iV /2 < 
Smin(^) < Smax(A) < 2\/~N) with probability 1 — n ■ exp(— cA^/n). On the other hand, by the coupon 
collector's problem we estimate the probability that Sniin(A) > Oas 1 — n-(l— — ~ 1 — n - exp{— Af/n). 
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A version of Theorem 15 . 4 1 1 holds for general, non-isotropic distributions. It is conve- 
nient to state it in terms of the equivalent estimate (|5.29[) : 

Theorem 5.44 (Heavy-tailed rows, non-isotropic). Let A be an N x n matrix whose 
rows Ai are independent random vectors in R" with the common second moment matrix 
E — EAi ® Ai. Let m be a number such that \\Ai\\2 < \/rri, almost surely for all i. Then 
for every t > 0, the following inequality holds with probability at least 1 — n ■ exp{—ct^): 

S|| < max(||I]||i/2j,<52) where 6 = t^. (5.32) 

Here c > is an absolute constant. In particular, this inequality yields 

\\A\\ <\\Y.\\^/^y/N + t^/^. (5.33) 

Proof We note that m > because = ||EAi«)yli|| < E|| (g) = E||A,||| < m. 
Then (|5.32p follows by a straightforward modification of the argument of Theorem l5.41l 
Furthermore, if (|5.32p holds then by triangle inequality 

— \\Af = \\—A*A\\ < iisii + \\—a*a-j:\\ 

Taking square roots and multiplying both sides by VN, we obtain (|5.33p . □ 



The almost sure boundedness requirement in Theorem 15.411 may sometimes be too 
restrictive in applications, and it can be relaxed to a bound in expectation: 

Theorem 5.45 (Heavy-tailed rows; expected singular values). Let A be an N x n 

matrix whose rows Ai are independent isotropic random vectors in M". Let m := 
Emaxi<jv H^iHl- Then 

E max Is-, (A) - \/N\ < Cv/mlogmin(A^, n) 

j<n 

where C is an absolute constant. 

The proof of this result is similar to that of Theorem 15.411 except that this time we 
will use Rudelson's Corollarv 15.281 instead of matrix Bernstein's inequality. To this end, 
we need a link to symmetric Bernoulli random variables. This is provided by a general 

symmetrization argument: 

Lemma 5.46 (Symmetrization). Let [Xi) be a finite sequence of independent random 
vectors valued in some Banach space, and (e^) be independent symmetric Bernoulli ran- 
dom variables. Then 

e||^(X, -EX,) <2e||^£,X, . (5.34) 
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Proof. We define random variables Xi = Xi — X[ where {X'^ is an independent copy 
of the sequence {Xi). Then Xi are independent symmetric random variables, i.e. the 
sequence {Xi) is distributed identically with {—Xi) and thus also with {siXi). Replacing 
EXi by EXj' in ()5.34p and using Jensen's inequality, symmetry, and triangle inequality, 
we obtain the required inequality 



{X, - EX,) 



< 



e,X,, 



< E 



, £iXi 



□ 



We will also need a probabilistic version of Lemma [5.361 on approximate isometries. 
The proof of that lemma was based on the elementary inequality jz^ — 1| > max(|z — 
l|,|z — lp)forz>0. Here is a probabilistic version: 

Lemma 5.47. Let Z be a non-negative random variable. Then E|Z^ — 1| > max(E|Z — 
1|,(E|Z-1|)2). 

Proof. Since \Z - 1\ < \Z'^ - 1| pointwise, we have E|Z - 1| < EjZ^ - 1|. Next, since 
\Z — Ip < \Z'^ — 1| pointwise, taking square roots and expectations we obtain E|Z— 1| < 
E|Z2 - < (E|Z2 - where the last bound follows by Jensen's inequality. 

Squaring both sides completes the proof. □ 

Proof of Theorem \ 5. 4 5\ Step 1: Application of Rudelson's inequality. As in the 

proof of Theorem 15.411 we are going to control 



E 



E\\—A*A- 



E 



1 ^ 
N ^ 

i=l 



A,, (g) Ai 



<1e| 



N 

E 



SiAi (S) A-i 



where we used Symmetrization Lemma 15.461 with independent symmetric Bernoulli ran- 
dom variables Si (which are independent of A as well) . The expectation in the right hand 
side is taken both with respect to the random matrix A and the signs (e^). Taking first 
the expectation with respect to {si) (conditionally on A) and afterwards the expectation 
with respect to A, we obtain by Rudelson's inequality fCorollarv l5.28l) that 



f] / 

E < ^^E(^max||Ai||2 



N 



i<N 



N 

E 

1=1 



Ai (g) A, 



1/2 



where / = logmm{N,n). We now apply the Cauchy-Schwarz inequality. Since by the 



triangle inequality E\\-^ J2i=i ® ^ 



E\\j^A*A\\ < E 



1, it follows that 



E<C\ 



'-{E + l)^/\ 



This inequality is easy to solve in E. Indeed, considering the cases E < 1 and E > 1 
separately, we conclude that 



Ell— /|| < max((5,52) 



where 5 := C 



2ml 
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Step 2: Diagonalization. Diagonalizing the matrix A* A one checks that 

II 1/1* 4 7-11 |Sj(^)^ il /|Smi„(A)2 I |S,„ax(^)^ ,| 

— A A — I\\ = max \ — 1 = max 1 , 1 

II iV II j<n I iV I V N I' I TV I 

It foUows that 



(A)' {Af 



max(^E| '""^ ^ -1|,E| '""^ ^ - l|j < max(,5, <5"). 

(we replaced the expectation of maximum by the maximum of expectations). Using 
Lemma |5 .471 separately for the two terms on the left hand side, we obtain 



max (E 



V I Viv I I 1^ ~ 



Therefore 



:Sj[A) I /|Sinin(^) -.1 Smax(^) 

E max ■' , — 1 = E max = 1 , 



N 

< e( I ""'"'jf^^ - l| + I ^'""^^^ - i| ) <2S. 



'N ' ' VN 

Multiplying both sides by \/]V completes the proof. □ 

In a way similar to Theorem 15.441 we note that a version of Theorem 15.451 holds for 
general, non-isotropic distributions. 

Theorem 5.48 (Heavy-tailed rows, non-isotropic, expectation). Let A be an N x n 

matrix whose rows Ai are independent random vectors in R" with the common second 
moment matrix S = EA^ ® Ai. Let m :— Emaxi<7v 11 11 1- Then 



„ii /n„,,i/9,-^9v , ,- „ /mlogmin(A^, n) 

E — A*A - S < max S (5^ where S^C\ 

II TV II - VII II , ^ Y ^ 

Here C is an absolute constant. In particular, this inequality yields 
{E\\Afy^^ < Wm^^^VN + Cx/mlogTJim{N,n). 

Proof. The first part follows by a simple modification of the proof of Theorem 15. 451 The 
second part follows from the first like in Theorem 15.441 □ 

Remark 5.49 (Non-identical second moments). The assumption that the rows Ai have a 

common second moment matrix S is not essential in Theorems 15 . 441 and 15.481 The reader 

will be able to formulate more general versions of these results. For example, if Ai have 

arbitrary second moment matrices Si — KAi (g) Ai then the conclusion of Theorem 15.481 
j_ 

AT 



holds with T, = jj 
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5.4.3 Applications to estimating covariance matrices 

One immediate application of our analysis of random matrices is in statistics, for the 
fundamental problem of estimating covariance matrices. Let X be a random vector in 
R"; for simplicity we assume that X is centered^ MX = 0. Recall that the covariance 
matrix of X is the n x n matrix E = EX (g) X , see Section 15.2.51 

The simplest way to estimate S is to take some N independent samples Xi from 
the distribution and form the sample covariance matrix = jf X^i^i Xi Xi. By the 
law of large numbers, S^r — S almost surely as iV — > oo. So, taking sufficiently many 
samples we are guaranteed to estimate the covariance matrix as well as we want. This, 
however, does not address the quantitative aspect: what is the minimal sample size N 
that guarantees approximation with a given accuracy? 

The relation of this question to random matrix theory becomes clear when we arrange 
the samples Xi =: as rows of the Nxn random matrix A. Then the sample covariance 
matrix is expressed as T,j\i = ■^A*A. Note that j4 is a matrix with independent rows 
but usually not independent entries (unless we sample from a product distribution). We 
worked out the analysis of such matrices in Section 15.41 separately for sub-gaussian and 
general distributions. As an immediate consequence of Theorem 15. 39[ we obtain: 

Corollary 5.50 (Covariance estimation for sub-gaussian distributions). Consider 
a sub-gaussian distribution in R" with covariance matrix S, and let e G (0,1), t > 1. 
Then with probability at least 1 — 2 exp(— i^7i) one has 

IfN>C{t/efn t/ien IISat - S|| < e. 

Here C = C'k depends only on the sub-gaussian norm K — \\X\\^2 of a random vector 
taken from this distribution. 

Proof. It follows from (|5.25|) that for every s > 0, with probability at least 1 — 2 exp(— cs^) 
we have llE^v — S|| < max((5, S^) where S = C^/n/N + s/^/N . The conclusion follows for 
s = C't^/n where C" = is sufficiently large. □ 

Summarizing, Corollary 15.501 shows that the sample size 

N = 0{n) 

suffices to approximate the covariance matrix of a sub-gaussian distribution in R" by 
the sample covariance matrix. 

Remark 5.51 (Multiplicative estimates, Gaussian distributions). A weak point of Corol- 
lary [530] is that the sub-gaussian norm K may in turn depend on ||S||. 

To overcome this drawback, instead of using (|5.25|) in the proof of this result one 
can use the multiplicative version (|5.26p . The reader is encouraged to state a general 
result that follows from this argument. We just give one special example for arbitrary 
centered Gaussian distributions in R". For every e S (0,1), t > 1, the following holds 
with probability at least 1 — 2 exp(— t^n): 

If iV > C{t/efn then \\Y.n ~ S|| < e\\T.\\. 



"More generally, in this section we estimate the second moment matrix EX (g) Jf of an arbitrary 
random vector X (not necessarily centered). 
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Here C is an absolute constant. 



Finally, Theorem 15.441 yields a similar estimation result for arbitrary distributions, 
possibly heavy-tailed: 

Corollary 5.52 (Covariance estimation for arbitrary distributions). Consider a 

distribution in R" with covariance matrix E and supported in some centered Euclidean 
ball whose radius we denote \pm,. Let e G (0, 1) and t > 1. Then the following holds with 

_ ,2 

probability at least 1 — n : 

IfN> C{t/e)'^\\m-^m\ogn then \\T.n - S|l < £|1E|1. 
Here C is an absolute constant. 

Proof. It follows from Theorem 15.441 that for every s > 0, with probability at least 
1 — n • exp(— cs^) we have ||S]jv — S|| < max(||E|j^/^(5, (5^) where S = s^Jm/N. Therefore, 
iiN> (s/e)2|jE||-iTOthen ||I]jv-S|| <e||E|l. The conclusion follows with s = C"^^/IogTI 
where C" is a sufficiently large absolute constant. □ 

Corollarv l5.52l is typically used with m = 0(||S||ri). Indeed, if X is a random vector 
chosen from the distribution in question, then its expected norm is easy to estimate: 
E||X||2 — tr(I]) < n||E||. So, by Markov's inequality, most of the distribution is sup- 
ported in a centered ball of radius ^Jrn where m = 0(n||E||). If all distribution is 
supported there, i.e. if \\X\\ — 0{y/n\\I]\\) almost surely, then the conclusion of Corol- 
lary [532] holds with sample size N > C{t/e)'^nlogn. 

Remark 5.53 (Low-rank estimation). In certain applications, the distribution in R" lies 
close to a low dimensional subspace. In this case, a smaller sample suffices for covariance 
estimation. The intrinsic dimension of the distribution can be measured with the effective 
rank of the matrix E, defined as 

One always has r(E) < rank(E) < n, and this bound is sharp. For example, if X 
is an isotropic random vector in R" then E = / and r(E) = n. A more interesting 
example is where X takes values in some r-dimensional subspace i?, and the restriction 
of the distribution of X onto E is isotropic. The latter means that E = Pe, where Pe 
denotes the orthogonal projection in R" onto E. Therefore in this case 'r(E) = r. The 
effective rank is a stable quantity compared with the usual rank. For distributions that 
are approximately low-dimenional, the effective rank is still small. 

The effective rank r — r(E) always controls the typical norm of X, as E||X||| = 
tr(E) = ''IISII. It follows by Markov's inequality that most of the distribution is sup- 
ported in a ball of radius y/m where m = 0(r||E||). Assume that all of the distribution 
is supported there, i.e. if \\X\\ = 0(Y^r||E||) almost surely. Then the conclusion of 
Corollarv 15 . 5 2 1 holds with sample size N > C(t/e)'^rlogn. 

We can summarize this discussion in the following way: the sample size 



N = 0{n\ogn) 
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suffices to approximate the covariance matrix of a general distribution in R" by the 
sample covariance matrix. Furthermore, for distributions that are approximately low- 
dimensional, a smaller sample size is sufficient. Namely, if the effective rank of S equals 
r then a sufficient sample size is 

N = 0(r log n). 

Remark 5.54 (Boundedness assumption). Without the boundedness assumption on the 
distribution, Corollarv l5.52l mav fail. The reasoning is the same as in Remark l5.42l for an 
isotropic distribution which is highly concentrated at the origin, the sample covariance 
matrix will likely equal 0. 

Still, one can weaken the boundedness assumption using Theorem l5.48l instead of The- 
orem l5.4ll in the proof of Corollarv l5.52l The weaker requirement is that E maxi<7v || -'^^i || 2 ^ 
m where Xi denote the sample points. In this case, the covariance estimation will be 
guaranteed in expectation rather than with high probability; we leave the details for the 
interested reader. 

A different way to enforce the boundedness assumption is to reject any sample points 
Xi that fall outside the centered ball of radius ^/m. This is equivalent to sampling 
from the conditional distribution inside the ball. The conditional distribution satisfies 
the boundedness requirement, so the results discussed above provide a good covariance 
estimation for it. In many cases, this estimate works even for the original distribution - 
namely, if only a small part of the distribution lies outside the ball of radius y/m. We 
leave the details for the interested reader; see e.g. [81j . 

5.4.4 Applications to random sub-matrices and sub-frames 

The absence of any moment hypotheses on the distribution in Section 15.4.21 (except 
finite variance) makes these results especially relevant for discrete distributions. One 
such situation arises when one wishes to sample entries or rows from a given matrix B, 
thereby creating a random sub-matrix A. It is a big program to understand what we can 
learn about B by seeing A, see [331 ISS [SS]- In other words, we ask - what properties 
of B pass onto A? Here we shall only scratch the surface of this problem: we notice 
that random sub-matrices of certain size preserve the property of being an approximate 
isometry. 

Corollary 5.55 (Random sub-matrices). Consider an M x n matrix B such thaW^ 
Smin(-B) = •Sniax(5) = VM . Let m he such that all rows Bi of B satisfy \\Bi\\2 < \/m. 
Let A be an N X n matrix obtained by sampling N random rows from B uniformly and 
independently. Then for every t > 0, with probability at least 1 — 2n ■ exp(— ci^) one has 

%/iV - ty/^ < S,„i„(A) < Smax(^) < //V + ty/7^. 

Here c > is an absolute constant. 

^^The first hypothesis says B*B = MI. Equivalently, B := -^^^B is an isometry, i.e. |]iJ2:||2 = ||2;||2 
for all X. Equivalently, the columns of B are orthonormal. 
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Proof. By assumption, I = jj-B*B ~ j4J2i=i^i ^i- Therefore, the uniform dis- 
tribution on the set of the rows {Bi, . . . , Bm} is an isotropic distribution in R". The 
conclusion then foUows from Theorem 15.411 □ 

Note that the conclusion of Corollary 15.551 does not depend on the dimension M of 
the ambient matrix B. This happens because this result is a specific version of sampling 
from a discrete isotropic distribution (uniform on the rows of S), where size M of the 
support of the distribution is irrelevant. 

The hypothesis of Corollarv 15.551 implieJ^ that X^t^i ll-^illi — Hence by 
Markov's inequality, most of the rows Bi satisfy ||i?i||2 = 0{y/n). This indicates that 
Corollarv l5 . 551 would be often used with m = 0(n). Also, to ensure a positive probability 
of success, the useful magnitude of t would he t ^ -^/log n. With this in mind, the ex- 
tremal singular values of A will be close to each other (and to VN) ii N ^ t^m ^ n log n. 

Summarizing, Corollarv 15.551 states that a random O(nlogn) x n sub-matrix of an 
M X n isometry is an approximate isometrvl^ 

Another application of random matrices with heavy-tailed isotropic rows is for sam- 
pling from frames. Recall that frames are generalizations of bases without linear inde- 
pendence, see Example 15.211 Consider a tight frame {u^jfl^ for the sake 
of convenient normalization, assume that it has bounds A — B — M. We are inter- 
ested in whether a small random subset of {ui}f£^ is still a nice frame in R". Such 
question arises naturally because frames are used in signal processing to create redun- 
dant representations of signals. Indeed, every signal x G M" admits frame expansion 
X = -jg- Redundancy makes frame representations more robust to errors 
and losses than basis representations. Indeed, we will show that if one loses all except 
N = 0(n log n) random coefficients {ui,x) one is still able to reconstruct x from the 
received coefficients {ui^ ,x) as x Sfe=i ^ ^)^ik ■ This boils down to showing that 
a random subset of size = 0(n log n) of a tight frame in R" is an approximate tight 
frame. 

Corollary 5.56 (Random sub-frames, see [80]). Consider a tight frame {ui}fli in R" 
with frame bounds A — B — M . Let number m be such that all frame elements satisfy 
\\ui\\2 < y/rn. Let {vi}fLi be a set of vectors obtained by sampling N random elements 
from the frame {ui}f£i uniformly and independently. Let e E (0, 1) and t > I. Then the 

,2 

following holds with probability at least 1 — 2n~ : 

If N > C{t/eYm\ogn then is a frame in R" 

with bounds A = {\ — £)N , B = (1 -\- e)N . Here C is an absolute constant. 

In particular, if this event holds, then every x G R" admits an approximate represen- 
tation using only the sampled frame elements: 



I 1 ^ 

\^ — ^{vi,x)v. 



<£\\x\\ 



^■^To recall why this is true, take trace of both sides in the identity I = jj- ^i-Li ® ^i- 
^^For the purposes of compressed sensing, we shall study the more difficult uniform problem for 
random sub-matrices in Section 15.61 There B itself will be chosen as a column sub-matrix of a given 
M X M matrix (such as DFT), and one will need to control all such B simultaneously, see Example l5.73l 
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Proof. The assumption implies that I = St^i ® Ui. Therefore, the uniform dis- 
tribution on the set {ui}fLi is an isotropic distribution in R". Applying Corollary 15.521 
with ~ I and E^r = -i. X^ili '^i ® conclude that ||I]jv — -^|| < £ with the required 
probability. This clearly completes the proof. □ 

As before, we note that -jjYm^Li ll^illi = so Corollary 15.561 would be often used 
with m — 0{n). This shows, liberally speaking, that a random subset of a frame in R" 
of size N = 0{n\ogn) is again a frame. 

Remark 5.57 (Non-uniform sampling). The boundedness assumption ||ui||2 < ^,1- 
though needed in Corollary 15.561 can be remoyed by non-uniform sampling. To this 
end, one would sample from the set of normalized yectors Ui ^^Ju^ with proba- 
bilities proportional to IjuiUj. This defines an isotropic distribution in R", and clearly 
||wi||2 = \/n. Therefore, by Theorem 15. 56[ a random sample of iV = O(nlogn) yectors 
obtained this way forms an almost tight frame in R". This result does not require any 
bound on ||ui||2. 



5.5 Random matrices with independent columns 

In this section we study the extreme singular yalues of iV x n random matrices A with 
independent columns Aj. We are guided by our ideal bounds (15. 2p as before. The same 
phenomenon occurs in the column independent model as in the row independent model - 
sufficiently tall random matrices A are approximate isometrics. As before, being tall will 
mean N n for sub-gaussian distributions and A'' ^ nlogn for arbitrary distributions. 

The problem is equiyalent to studying Gram matrices G = A* A = ((^j, of 
independent isotropic random yectors Ai, . . . , An in M^. Our results can be interpreted 
using Lemma 15.361 as showing that the normalized Gram matrix -i-G is an approximate 
identity for A, n as aboye. 

Let us first try to proye this with a heuristic argument. By Lemma 15.201 we know 
that the diagonal entries of j^G haye mean ■i-E||j4j||2 = 1 and off-diagonal ones have 
zero mean and standard deyiation ■i-(E(Aj, yl^)^)^/^ = If, hypothetically, the 

off-diagonal entries were independent, then we could use the results of matrices with 
independent entries (or eyen rows) developed in Section 15.41 The off-diagonal part of 
■j^G would have norm 0{^J^) while the diagonal part would approximately equal /. 
Hence we would have 

i.e. -^G is an approximate identity for A ^ rt. Equivalently, by Lemma [5.36[ (|5.35p 
would yield the ideal bounds (|5.2I) on the extreme singular values of A. 

Unfortunately, the entries of the Gram matrix G are obviously not independent. To 
overcome this obstacle we shall use the decoupling technique of probability theory [221. 
We observe that there is still enough independence encoded in G. Consider a principal 
sub-matrix [As)* {At) of G = A* A with disjoint index sets S and T. If we condition on 
{Ak)k(^T then this sub-matrix has independent rows. Using an elementary decoupling 
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technique, we will indeed seek to replace the full Gram matrix G by one such decoupled 
S X T matrix with independent rows, and finish off by applying results of Section 15.41 

By transposition one can try to reduce our problem to studying the n x N matrix 
A*. It has independent rows and the same singular values as A, so one can apply results 
of Section [5^ The conclusion would be that, with high probability, 

- C\/iV < S,„i„(A) < S„,ax(A) <V^l + CVN. 

Such estimate is only good for flat matrices {N < n). For tall matrices [N > n) the 
lower bound would be trivial because of the (possibly large) constant C. So, from now 
on we can focus on tall matrices {N > n) with independent columns. 

5.5.1 Sub-gaussian columns 

Here we prove a version of Theorem 15.391 for matrices with independent columns. 

Theorem 5.58 (Sub-gaussian columns). Let A be an N x n matrix (N > n) whose 
columns Ai are independent sub-gaussian isotropic random vectors in M.^ with \\Aj\\2 = 
y/N a. s. Then for every t > 0, the inequality holds 

Vn - CV^i - t < s,„i„(A) < Smax(^) < + CV^ + t (5.36) 

with probability at least 1 — 2exp(— ci^), where C — C^, c = > depend only on the 
subgaussian norm K = maxj of the columns. 

The only significant difference between Theorem 15.391 for independent rows and The- 
orem [5?58] for independent columns is that the latter requires normalization of columns, 
ll^j"||2 = almost surely. Recall that by isotropy of Aj (see Lemma [5T20]) one always 
has (E||ylj 112)^''^ = ^/N , but the normalization is a bit stronger requirement. We will 
discuss this more after the proof of Theorem 15.581 

Remark 5.59 (Gram matrices are an approximate identity). By Lemma 15.361 the con- 
clusion of Theorem 15.581 is equivalent to 

\\-A*A-I\\ <C.f^+^ 

with the same probability 1 — 2exp(— ci^). This establishes our ideal inequality (|5.35p . 
In words, the normalized Gram matrix of n independent sub-gaussian isotropic random 
vectors in is an approximate identity whenever N ^ n. 

The proof of Theorem 15.581 is based on the decoupling technique [22| . What we will 
need here is an elementary decoupling lemma for double arrays. Its statement involves 
the notion of a random subset of a given finite set. To be specific, we define a random 
set T of [n] with a given average size m g [0, n] as follows. Consider independent 
{0, 1} valued random variables 61,. . . ,Sn with KSi — m/n; these are sometimes called 
independent selectors. Then we define the random subset T = {i Cz [n] : Si — 1}. Its 
average size equals E|T| = IEX]r=i ~ 
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Lemma 5.60 (Decoupling). Consider a double array of real numbers {o-ij)^ j^i such 
that an — for all i. Then 

= H 

i,jeln] ieTJeT" 

where T is a random subset of [n] with average size n/2. In particular, 
4 min > an < > an < 4 max > a, , 

TC[n] ^ t] - ^] - ^ y 

where the minimum and maximum are over all subsets T of [n] . 

Proof. Expressing the random subset a,sT — {i £ [n] : 6i — 1} where 6i are independent 
selectors with E5i = 1/2, we see that 

ieTjeT^ i,je[n] ij'eH 

where we used that M6i{l ~ Sj) — j for i j and the assumption an = 0. This proves 
the first part of the lemma. The second part follows trivially by estimating expectation 
by maximum and minimum. □ 

Proof of Theorem \5.58[ Step 1: Reductions. Without loss of generality we can as- 
sume that the columns Ai have zero mean. Indeed, multiplying each column Ai hy ±1 
arbitrarily preserves the extreme singular values of A, the isotropy of Ai and the sub- 
gaussian norms of Ai . Therefore, by multiplying Ai by independent symmetric Bernoulli 
random variables we achieve that Ai have zero mean. 

For t = 0{\/N) the conclusion of Theorem 15.581 follows from Theorem 15.391 bv trans- 
position. Indeed, the n x N random matrix A* has independent rows, so for t > we 
have 

Smax(A) = < V^+CK^+t (5.37) 

with probability at least 1 — 2 exp(— c^f t^). Here ck > and we can obviously assume 
that Ck > 1- For t > Ck^/N it follows that Smax(^) < VN + y/n + 2t, which yields the 
conclusion of Theorem 15.581 (the left hand side of (|5.36l) being trivial). So, it suffices to 
prove the conclusion for t < Ck\/N- Let us fix such t. 

It would be useful to have some a priori control of Smax(^) = ll^ll- We thus consider 
the desired event 

£ {smax(A) < 3CkVN}. 



Since 3Ck\N > y/n + Ck\N + t, by (|5.37p we see that £ is likely to occur: 

=) < 2exp(-CKt2). (5.38) 



Step 2: Approximation. This step is parallel to Step 1 in the proof of The- 
orem I5.39[ except now we shall choose e := d. This way we reduce our task to the 
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following. Let be a ;j-net of the unit sphere S" ^ such that |7V| < 9". It suffices to 
show that with probability at least 1 — 2cxp(— c'^i^) one has 



max 



j;^Ux\\l-l 



< — , where 5 — C\l — 
-2' \ N 



N 



By (|5.38p . it is enough to show that the probability 



max 



N 



\Ax\ 



> - and £"1 



(5.39) 



satisfies p < 2 exp(— c'^i^), where > may depend only on K. 

Step 3: Decoupling. As in the proof of Theorem 15.391 we will obtain the required 
bound for a fixed x G A/" with high probability, and then take a union bound over x. So 
let us fix any x = {xi, . . . , a;„) G S"^^. We expand 



J2 XjXk{Aj,Ak). (5.40) 

j,fce[n], j^fc 



Since ||Aj||2 — ^ by assumption and ||a;||2 = 1, the first sum equals N. Therefore, 
subtracting N from both sides and dividing by N, we obtain the bound 



N 



\\Ax\\l 



< 



N 



^ ^ XjXk{Aj, Ak) 



j,k<£[n], j^k 



The sum in the right hand side is {Gqx, x) where Go is the off-diagonal part of the Gram 
matrix G = A* A. As we indicated in the beginning of Section [5.51 we are going to replace 
Go by its decoupled version whose rows and columns are indexed by disjoint sets. This 
is achieved by Decoupling Lemma 15.601 we obtain 



\Ax\\i - 1 



< — max i?T (a;) , where i?T (a;) = XjXk{Aj,Ak). 

~ N TC[,il ^ J \ / 



We substitute this into (|5.39l) and take union bound over all choices oi x ^ M and 
T C [n]. As we know, \M\ < 9", and there are 2" subsets T in [n]. This gives 



n<p| max \Rt(x)\ > — and s] 



< 9" • 2" • max 



f\\Rt{x)\ > ^ and s] 

nl L 8 J 



(5.41) 



Step 4: Conditioning and concentration. To estimate the probability in (|5.4ip . 
we fix a vector x € Af and a subset T C [n] and we condition on a realization of random 
vectors {Ak)keT'=- We express 



Rt{x) = '^^Xj{Aj,z) where z = XkAk- 



(5.42) 
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Under our conditioning x is a fixed vector, so Rxix) is a sum of independent random 
variables. Moreover, if event E holds then z is nicely bounded: 

\\zh<\\A\\\\x\\2<iCK^. (5.43) 

If in turn (|5.43|) holds then the terms {Aj , z) in ()5.42|) are independent centered sub- 
gaussian random variables with ||(^j,z)||v'2 ^ 'AKCk'/N . By Lemma [5.91 their linear 
combination Rt(x) is also a sub-gaussian random variable with 

1/2 

||i?T(x)|U, <Ci(^x2||(A„z)||2^) <Ck^ (5.44) 

where Ck depends only on K. 

We can summarize these observations as follows. Denoting the conditional probability 
by Pt = P{ • \{Ak)keT''} and the expectation with respect to {Ak)keT<' by Et^, we obtain 
by and ([5?il| that 

^[\Rt{x)\ > ^ and S} < Et^]Pt{\Rt{x)\ > ^ and |lz|l2 < 3CkVn} 
SN/8 ^ ( C25'^N\^^ f csC^n cat^ 



< 2 exp 



Cl 



Ck^N 



2 exp I — ) < 2 cxp 



The second inequality follows because Rt{x) is a sub-gaussian random variable (|5.44p 
whose tail decay is given by (|5.10l) . Here ci,C2 > are absolute constants. The last 
inequality follows from the definition of S. Substituting this into (j5.4ip and choosing C 
sufficiently large (so that In 36 < C2C^/C|-), we conclude that 

p < 2exp(- catV^I). 

This proves an estimate that we desired in Step 2. The proof is complete. □ 

Remark 5.61 (Normalization assumption). Some a priori control of the norms of the 
columns ||^j||2 is necessary for estimating the extreme singular values, since 

(A) < mm\\Aj\\2 < max||Aj||2 < s 

max 

(A). 

■i<.n t<n 

With this in mind, it is easy to construct an example showing that a normalization 



assumption ||Ai||2 = v is essential in Theorem 15.581 it can not even be replaced by a 
boundedness assumption \\Ai\\2 = 0{^/N). 

Indeed, consider a random vector X = V^^Y in where ^ is a {0, l}-valued random 
variable with — 1/2 (a "selector") and X is an independent spherical random vector 
in M" (see Example l5.25p . Let A be a random matrix whose columns Aj are independent 
copies of X. Then Aj are independent centered sub-gaussian isotropic random vectors 
in K" with \\Aj\\^^ = 0(1) and \\Aj\\2 < V2N a.s. So ah assumptions of Theorem[538] 
except normalization are satisfied. On the other hand F{X = 0} = 1/2, so matrix A has 
a zero column with overwhelming probability 1 — 2~". This implies that Sinin(^) — 
with this probability, so the lower estimate in (|5.36p is false for all nontrivial N,n,t. 
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5.5.2 Heavy-tailed columns 

Here we prove a version of Theorem 15.451 for independent heavy-tailed columns. 

We thus consider N xn random matrices A with independent columns Aj . In addition 
to the normalization assumption ||Aj||2 = already present in Theorem 15. 581 for sub- 
gaussian columns, our new result must also require an a priori control of the off-diagonal 
part of the Gram matrix G = A* A = {{Aj, A^))" f,_-^^. 

Theorem 5.62 (Heavy-tailed columns). Let A be an N x n matrix (N > n) whose 
columns Aj are independent isotropic random vectors in with \\Aj\\2 — VlV a. s. 
Consider the incoherence parameter 

m:=4lEmax V {Aj,Akf. 

iV j^n — 

Then E||^AM - /|| < Cq^J^^^. In particular, 

Emax|sj(A) - y/N\ < Cy^mlogn. (5.45) 

j<n 

Let us briefly clarify the role of the incoherence parameter m, which controls the 
lengths of the rows of the off-diagonal part of G. After the proof we will see that a 
control of m is essential in Theorem 15.411 But for now, let us get a feel of the typical 
size of m. We have ^{Aj, Ak)'^ = N hy Lemma [5.20[ so for every row j we see that 
W J2ki£[n] k^j(^J^^k)^ = n — 1. This indicates that Theorem 15.621 would be often used 
with m = 0{n). 

In this case. Theorem 15.411 establishes our ideal inequality (|5.35p up to a logarithmic 
factor. In words, the normalized Gram matrix of n independent isotropic random vectors 
in M.^ is an approximate identity whenever N 3> 7ilogn. 

Our proof of Theorem 15.621 will be based on decoupling, symmetrization and an 
application of Theorem 15.481 for a decoupled Gram matrix with independent rows. The 
decoupling is done similarly to Theorem 15.581 However, this time we will benefit from 
formalizing the decoupling inequality for Gram matrices: 

Lemma 5.63 (Matrix decoupling). Let B be a N xn random matrix whose columns Bj 
satisfy |l-Bj||2 = 1- Then 

m\B*B -I\\ < 4maxE||(BT)*BT<:||. 

TC[n] 

Proof. We first note that \\B*B-I\\ = sup^g5„-i | ||i3a;||2 - 1 1. We fix x = (xi, . . . ,x„) & 
S^~^ and, expanding as in (|5.40p . observe that 



n 

\\Bx\\l=Y.^]\\B,\\l+ ^jMB„B,}. 

j=l j,ke[n],j^k 
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The first sum equals 1 since ||-Bj||2 = \\x\\2 = 1- So by Decoupling Lemnia [5.601 a random 
subset T of [n] with average cardinality n/2 satisfies 

||Sa;||^ - 1 = 4Et Xj^k{Bj,Bk}. 
jeT.keT" 

Let us denote by Et and the expectations with respect to the random set T and the 
random matrix B respectively. Using Jensen's inequality we obtain 

Eb\\B*B-I\\=Eb sup |||Ba;||^-l| 

<4EbEt sup I XjXk{Bj,Bk)\=A'ETE.B\\{BTYBT4. 

The conclusion follows by replacing the expectation by the maximum over T. □ 

Proof of Theorem \5.62[ Step 1: Reductions and decoupling. It would be useful to 
have an a priori bound on Sniax(^) = ll^ll- We can obtain this by transposing A and 
applying one of the results of Section 15.41 Indeed, the random n x N matrix A* has 
independent rows A* which by our assumption are normalized as II2 = II II 2 = v^- 
Applying Theorem 15.451 with the roles of n and N switched, we obtain by the triangle 
inequality that 



E|m| =E||A*|| =Es^,^{A*) <V^ + C^Nlogn< C^N log n. (5.46) 

Observe that n <m since by Lemma [5.201 we have -^E{Aj, A^)^ ~ 1 for j ^ k. 
We use Matrix Decoupling Lemma [5.631 for B = -^A and obtain 

4 4 
E< — max E||Mt)*Ato|| = — max E||r|| (5.47) 

- N TC[n] ^ " N TC[n] " " ^ ^ 

where F = r(T) denotes the decoupled Gram matrix 

T = {ArrAr. = {{A„A,))^^^^^^^^. 

Let us fix T; our problem then reduces to bounding the expected norm of F. 

Step 2: The rows of the decoupled Gram matrix. For a subset S C [n], we 
denote by E^^ the conditional expectation given As^, i.e. with respect to As — {Aj)j^s- 
Hence E = EAj-aEAj.- 

Let us condition on At<: ■ Treating {A]S)}^^rpc as fixed vectors we see that, conditionally, 
the random matrix F has independent rows 

F, = ((A„Afc)),_,, J ST. 



So we are going to use Theorem 15.481 to bound the norm of F. To do this we need 
estimates on (a) the norms and (b) the second moment matrices of the rows Tj. 
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(a) Since for j € T, Tj is a random vector valued in R-^ , we estimate its second 
moment matrix by choosing x S M"^ and evaluating the scalar second moment 



keT" keT" 

= II Y ^kAkf = \\At^x\\1 < \\At41MI 
keT<: 

In the third equality we used isotropy of Aj. Taking maximum over all j Cz T and 
X E M.^ , we see that the second moment matrix S(rj) = Kat'^j 18 Tj satisfies 

max||E(r,)|| < II^Tcf. (5.48) 

(b) To evaluate the norms of F^, j € T, note that ||rj||2 = Z^fceT"^ (^i' ^fc)^- "^^^^ 
easy to bound, because the assumption says that the random variable 

M:=— max (A,, Afc)^ satisfies EM = m. 

'keln],k^j 

This produces the bound Emaxjgr ||rj||| < N ■ EM = Nm. But at this moment we 
need to work conditionally on At<: , so for now we will be satisfied with 

E^^ max llr^ ll^ < N ■ Ea^M. (5.49) 

Step 3: The norm of the decoupled Gram matrix. We bound the norm of the 
random TxT*^ Gram matrix F with (conditionally) independent rows using Theorem l5.48l 
and RemarklUni Since by (jOS]) we have || EjeT < ]T\T,jeT W^i^M ^ 

||At=|P, we obtain using (|5.49l) that 

E^,||r|| < (E^,||r||2)i/2 < II ^\ + C^N ■ Ea^ (M) log \T-\ 



< ||AT<=||\/^+CV^-E^^(M)logn. (5.50) 

Let us take expectation of both sides with respect to . The left side becomes the 
quantity we seek to bound, E||r||. The right side will contain the term which we can 
estimate by (|5.46p : 

EA^a\\AT4 =E||At<^|| < E||A|| < Cy/Nlogn. 
The other term that will appear in the expectation of (|5.50p is 



Ea^. V^At{M) < ^Ea^.EaAM) < VEM = 
So, taking the expectation in (j5.50l) and using these bounds, we obtain 



E||r|| =E^^,E^^||r|| < C^jN log n0I + C^jNm log n < 2Cy/Nm\ogi 
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where we used that n < m. Finally, using this estimate in (I5.47P we conclude 

This establishes the first part of Theoreni l5.62l The second part follow by the diagonal- 
ization argument as in Step 2 of the proof of Theorem 15.451 □ 

Remark 5.64 (Incoherence). A priori control on the incoherence is essential in Theo- 
rem 15.621 Consider for instance a,n N y. n random matrix A whose columns are inde- 
pendent coordinate random vectors in M^. Clearly Si„ax(^) > maxj ||j4i||2 = ^/N. On 
the other hand, if the matrix is not too tall, n ^ Vn , then A has two identical columns 
with high probability, which yields Smin(^) = 0. 

5.6 Restricted isometries 

In this section we consider an application of the non- asymptotic random matrix theory 
in compressed sensing. For a thorough introduction to compressed sensing, see the 
introductory chapter of this book and |2H1 US • 

In this area, m x n matrices A are considered as measurement devices, taking as 
input a signal x G R" and returning its measurement y — Ax e R™. One would like 
to take measurements economically, thus keeping m as small as possible, and still to be 
able to recover the signal x from its measurement y. 

The interesting regime for compressed sensing is where we take very few measure- 
ments, m <^ n. Such matrices A are not one-to-one, so recovery of x from y is not possible 
for all signals x. But in practical applications, the amount of "information" contained 
in the signal is often small. Mathematically this is expressed as sparsity of x. In the 
simplest case, one assumes that x has few non-zero coordinates, say | supp(a:;)| < k ^ n. 
In this case, using any non-degenerate matrix A one can check that x can be recovered 
whenever m > 2k using the optimization problem min{| supp(a;)| : Ax = y}. 

This optimization problem is highly non-convex and generally NP-complete. So in- 
stead one considers a convex relaxation of this problem, min{||x||i : Ax — y}. A basic 
result in compressed sensing, due to Candes and Tao [T71 [TB] , is that for sparse signals 
I supp(a;)| < fc, the convex problem recovers the signal x from its measurement y exactly, 
provided that the measurement matrix A is quantitatively non-degenerate. Precisely, 
the non-degeneracy of A means that it satisfies the following restricted isometry property 
with d2k{A) < 0.1. 

Definition (Restricted isometries). An mx n matrix A satisfies the restricted isometry 
property of order k > 1 if there exists 5/c > such that the inequality 

{l-Sk)\\x\\l < \\Ax\\l < {1 + Sk)\\x\\l (5.51) 

holds for all x G R" with \ supp(a;)| < k. The smallest number 5k = 5k{A) is called the 
restricted isometry constant of A. 
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In words, A has a restricted isometry property if A acts as an approximate isometry 
on all sparse vectors. Clearly, 

6k{A) = max HA^At - /et|1 = max ||A^At - Irt\\ (5.52) 

|T|<fc \T\ = lk\ 

where the maximum is over all subsets T C [n] with |r| < fc or |r| = [/cj . 

The concept of restricted isometry can also be expressed via extreme singular values, 
which brings us to the topic we studied in the previous sections. A is a restricted isometry 
if and only if all to x fc sub-matrices At of A (obtained by selecting arbitrary k columns 
from A) are approximate isometrics. Indeed, for every <5 > 0, Lemma [5.361 shows that 
the following two inequalities are equivalent up to an absolute constant: 

4(A) < max(5,(5^); (5.53) 
1-S< s,„i„(AT) < Sm^Mr) <l + S for ah \T\ < k. (5.54) 

More precisely, (I5.53|) implies (j5.54p and (|5.54l) implies Sk{A) < 3max{5,d^). 

Our goal is thus to find matrices that are good restricted isometrics. What good 
means is clear from the goals of compressed sensing described above. First, we need to 
keep the restricted isometry constant Sh{A) below some small absolute constant, say 0.1. 
Most importantly, we would like the number of measurements m to be small, ideally 
proportional to the sparsity k <^ n. 

This is where non- asymptotic random matrix theory enters. We shall indeed show 
that, with high probability, m x n random matrices A are good restricted isometrics of 
order k with m = 0*{k). Here the O* notation hides some logarithmic factors of n. 
Specifically, in Theorem 15.651 we will show that 

m — 0{k\og{n/k)) 

for sub-gaussian random matrices A (with independent rows or columns). This is due to 
the strong concentration properties of such matrices. A general observation of this kind 
is Proposition 15.661 It says that if for a given a random matrix A (taken from any 
distribution) satisfies inequality (|5.5ip with high probability, then A is a, good restricted 
isometry. 

In Theorem 15. 711 we will extend these results to random matrices without concentra- 
tion properties. Using a uniform extension of Rudelson's inequality, CoroUarv 15.281 we 
shall show that 

TO = 0(/clog^n) (5.55) 

for heavy-tailed random matrices A (with independent rows). This includes the impor- 
tant example of random Fourier matrices. 

5.6.1 Sub-gaussian restricted isometries 

In this section we show that m x n sub-gaussian random matrices A are good restricted 
isometries. We have in mind either of the following two models, which we analyzed in 
Sections 15.4.11 and 15.5.1! respectively: 
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Row- independent model: the rows of A are independent sub-gaussian isotropic ran- 
dom vectors in M"; 

Column-independent model: the columns Ai of A are independent sub-gaussian 
isotropic random vectors in R'" with ||Ai||2 = 

Recall that these models cover many natural examples, including Gaussian and Bernoulli 
matrices (whose entries are independent standard normal or symmetric Bernoulli ran- 
dom variables), general sub-gaussian random matrices (whose entries are independent 
sub-gaussian random variables with mean zero and unit variance), "column spherical" 
matrices whose columns are independent vectors uniformly distributed on the centered 
Euclidean sphere in R™ with radius \/m, "row spherical" matrices whose rows are in- 
dependent vectors uniformly distributed on the centered Euclidean sphere in R"^ with 
radius \/d, etc. 

Theorem 5.65 (Sub-gaussian restricted isometrics). Let A be an m x n sub-gaussian 
random matrix with independent rows or columns, which follows either of the two models 
above. Then the normalized matrix A = -^j^A satisfies the following for every sparsity 
level 1 < k < n and every number (5 G (0, l)-' 

ifm> CS-^k\og{en/k) then 4(A) < S 

with probability at least 1 — 2 cxp(— C(5^m). Here C — C'k, c — ck > depend only on 
the subgaussian norm K = max^ ||v4i||^2 of the rows or columns of A. 

Proof. Let us check that the conclusion follows from Theorem l5.39l for the row-independent 
model, and from Theorem [535] for the column- independent model. We shall control the 
restricted isometry constant using its equivalent description (|5.52p . We can clearly as- 
sume that A: is a positive integer. 

Let us fix a subset T C [n], |T| = fc and consider the m x k random matrix At- If 
A folows the row-independent model, then the rows of At are orthogonal projections 
of the rows of A onto R"^ , so they are still independent sub-gaussian isotropic random 
vectors in R"^. If alternatively, A follows the column-independent model, then trivially 
the columns of At satisfy the same assumptions as the columns of A. In either case. 
Theorem 15.391 or Theorem 15.581 applies to At. Hence for every s > 0, with probability 
at least 1 — 2exp(— cs^) one has 

y/m- CoVk - s < Saiin{AT) < SmaxiAr) < Vrn + CoVk + s. (5.56) 
Using Lemma [5.361 for At = -^At, we see that (|5.56p implies that 

/aT s 
^ ~1=- 
m yjm 

Now we take a union bound over all subsets T C [n], |T| = k. Since there are 
(fc) < {en/kY ways to choose T, we conclude that 

max II^J^T ~^rt|| < 3 max((5o, <5q) 
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with probabiUty at least 1 — (^) • 2exp(— cs^) > 1 — 2 exp (fc log(en/fc) — cs^). Then, 
once we choose e > arbitrarily and let s — Ci ^ k \og{en/k) + Ey/m, we conclude with 
probability at least 1 — 2exp(— ce^m) that 



6k{A) < 3 max(,Sn, '^n) where = CoJ - + cJ ^^"^^^""^^^ + e. 

Finally, we apply this statement for e := 6/6. By choosing constant C in the state- 
ment of the theorem sufficiently large, we make m large enough so that < 6/3, which 
yields 3 max((So, (Jq) ^ The proof is complete. □ 

The main reason Theorem 15.651 holds is that the random matrix A has a strong con- 
centration property, i.e. that ||Aa;||2 ~ ||2;||2 with high probability for every fixed sparse 
vector X. This concentration property alone implies the restricted isometry property, 
regardless of the specific random matrix model: 

Proposition 5.66 (Concentration implies restricted isometry, see [IH])- Let A be an 

m X n random matrix, and let k > 1, 6 > 0, e > 0. Assume that for every fixed x e R", 
|supp(j:)| < fc, the inequality 

il~6)\\x\\l<\\Ax\\l<{l + 6)\\x\\l 

holds with probability at least 1 — exp(— em). Then we have the following: 

ifm> Ce"ifclog(en/fc) then 6k{A) < 26 

with probability at least 1 — exp(— em/2). Here C is an absolute constant. 

In words, the restricted isometry property can be checked on each individual vector 
X with high probability. 

Proof. We shall use the expression (|5.52p to estimate the restricted isometry constant. 
We can clearly assume that fc is an integer, and focus on the sets T C [n], |r| = fc. 
By Lemma 15.21 we can find a net Nt of the unit sphere S*""^ Pi M"^ with cardinality 
\Nt\ < 9*^. By Lemma [53 we estimate the operator norm as 

IIA^^It - Ik.A\ ^ 2 max \{{A*rpAT - /RT)a;,a;)| = 2 max ||l^a;|l2 - l|. 

Taking maximum over all subsets T C |r| ~ fc, we conclude that 

6k{A) < 2 max max 1 1 1 Ax 1 1 n — 1 1 . 

On the other hand, by assumption we have for every x € A/r that 

P{|Pa;||2 ~l\> 6} < exp(-em). 

Therefore, taking a union bound over (^') < {en/k)'^ choices of the set T and over 9*^ 
elements x £ Afx, we obtain that 

¥{6k{A) > 26} < (j^9'' exp{-em) < exp (fcln(en/fc) fcln9 - em) 

< exp(— em/2) 

where the last line follows by the assumption on m. The proof is complete. □ 



47 



5.6.2 Heavy-tailed restricted isometries 

In this section we show that m x n random matrices A with independent heavy-tailed 
rows (and uniformly bounded coefficients) are good restricted isometries. This result will 
be established in Theorem l5.71l As before, we will prove this by controlling the extreme 
singular values of all m x A: sub-matrices At- For each individual subset T, this can be 
achieved using Theorem 15.411 one has 

^/m— tVk < SaiiniAr) < Sniax(^T) < + tVk (5.57) 

with probability at least 1 — 2fc • exp(— ct^). Although this optimal probability estimate 
has optimal order, it is too weak to allow for a union bound over all (^) = (0(l)n/fc)'^ 
choices of the subset T. Indeed, in order that 1 — (^)2fc • exp(— ci^) > one would need 
to take t > y^klog{n/k). So in order to achieve a nontrivial lower bound in (j5.57|) . one 
would be forced to take m > k^. This is too many measurements; recall that our hope 
is m = 0*{k). 

This observation suggests that instead of controlling each sub-matrix At separately, 
we should learn how to control all At at once. This is indeed possible with the following 
uniform version of Theorem 15.451 

Theorem 5.67 (Heavy-tailed rows; uniform). Let A = [atj) be an N x d matrix (1 < 
N < d) whose rows Ai are independent isotropic random vectors in . Let K he a 
number such that all entries \aij \ < K almost surely. Then for every 1 < n < d, we have 

E max max \sAAt) - Va"! < Cl^ 

|T|<ni<|T| ' ^ 

where I — log(n)\/log d^log N and where C = Ck may depend on K only. The maxi- 
mum is, as usual, over all subsets T G [d], \T\ < n. 

The non-uniform prototype of this result. Theorem I5.45[ was based on Rudclson's 
inequality, Corollarv l5.28l In a very similar way. Theorem 1 5 . 6 71 is based on the following 
uniform version of Rudelon's inequality. 

Proposition 5.68 (Uniform Rudelson's inequality [67]). Let xi,...,xn be vectors in 
K'^, 1 < A < d, and let K be a number such that all ||a;i||oo < K. Let ei,...,eN be 
independent symmetric Bernoulli random variables. Then for every 1 < n < d one has 

^ ■■ ^ 1/2 



\T\<r. 



E max II ei{xi)T ® {xi)T < Cl^/n ■ max | ^^(0:^)7 ® {xi)T 

i=l l'7"l<-" 



\T\<n\ 



where I — \og{n)^\og d^Xog N and where C — Ck may depend on I'C only. 

The non-uniform Rudelson's inequality fCorollarv l5.28p was a consequence of a non- 
commutative Khintchine inequality. Unfortunately, there does not seem to exist a way to 
deduce Proposition l5.68l from any known result. Instead, this proposition is proved using 
Dudley's integral inequality for Gaussian processes and estimates of covering numbers 
going back to Carl, see [57]. It is known however that such usage of Dudley's inequality 
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is not optimal (see e.g. [75]). As a result, the logarithmic factors in Proposition 15 .681 are 
probably not optimal. 

In contrast to these difhculties with Rudelson's inequality, proving uniform versions 
of the other two ingredients of Theorem 15.451 - the deviation Lemma 15.471 and Sym- 
metrization Lemma 15.461 - is straightforward. 

Lemma 5.69. Let (Zt)t£T be a stochastic proces^^ such that all Zt > 0. Then 
Esup.g^ |Z2 - 1| > max(Esupjg^ \Zt - 1|, (Esup^^^ \Zt - 

Proof. The argument is entirely parallel to that of Lemma 15.471 □ 

Lemma 5.70 (Symmetrization for stochastic processes). Let Xu, 1 < i < N , t £T , he 

random vectors valued in some Banach space B, where T is a finite index set. Assume 
that the random vectors Xi — {Xti)t£T (valued in the product space B^ ) are independent. 
Let El, . . . ,£Ar be independent symmetric Bernoulli random variables. Then 

N N 

E sup 1 1 y - EXrt ) < 2E sup 1 1 y e^X^■, 

Proof. The conclusion follows from Lemma 15.461 applied to random vectors Xi valued in 
the product Banach space B^ equipped with the norm |||(Zt)tgr||| = suptg-7- \\Zt\\. The 
reader should also be able to prove the result directly, following the proof of Lemma [5. 461 

□ 

Proof of Theorem \5.67\ Since the random vectors Ai are isotropic in M'', for every fixed 
subset T C [d] the random vectors {Ai)T are also isotropic in R-^, so E{Ai)T <E) (^i)T = 
/jfT . As in the proof of Theorem 15.451 we are going to control 

1 II 1 ^ 

E ■.= E max \\—A*j.At - Irt\\ = E max \\^'y'{Ai)T «) (A^)t - 

|T|<n"iV " \T\<n\\N'^ 

2 II ~ 

< — E max y e,(^,)T «) (^Ot 

|T|<n .^^ 

where we used Symmetrization Lemma |5 . 701 with independent symmetric Bernoulli ran- 
dom variables ei, . . . ,£n- The expectation in the right hand side is taken both with 
respect to the random matrix A and the signs (e^). First taking the expectation with 
respect to (e^) (conditionally on A) and afterwards the expectation with respect to A, 
we obtain by Proposition l5.68l that 

E < ^ E max > [AAt ® {AAt = JL E max — A;;ylT 



TV |T|<n"7V 



^"^A stochastic process (Zt) is simply a collection of random variables on a common probability space 
indexed by elements t of some abstract set T. In our particular application, T will consist of all subsets 
T C [d], \T\ < n. 
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By the triangle inequality, Emax|r|<„ < E + 1. Hence we obtain 

by Holder's inequality. Solving this inequality in E we conclude that 

1 l2n 
= E max II— A:^At - /rtII < max((5,(5^) where S = CkI\ — • (5.58) 
\T\<n II iv ^ " ~ ^ ' \ N ^ ' 

The proof is completed by a diagonalization argument similar to Step 2 in the proof 
of Theorem 15.451 One uses there a uniform version of deviation inequality given in 
Lemma [5.691 for stochastic processes indexed by the sets |r| < n. We leave the details 
to the reader. □ 

Theorem 5.71 (Heavy-tailed restricted isometrics). Let A — (a^j) be an rax n matrix 
whose rows Ai are independent isotropic random vectors in R" . Let K be a number such 
that all entries \aij\ < K almost surely. Then the normalized matrix A — -^^A satisfies 
the following for m < n, for every sparsity level 1 < k < n and every number d G (0, 1).' 

ifm>C5~^k\ogn\og'^{k)\og{5-^k\ogn\og^k) then E6kiA) < 5. (5.59) 

Here C = Ck > may depend only on K . 

Proof. The result follows from Theorem 15. 67[ more precisely from its equivalent state- 
ment (|5.58p . In our notation, it says that 

E5fe(i) <max(5,52) where 5 = CkI\ - = CkxI ^^^'^ \og{k)^/k^. 

V m V TO 

The conclusion of the theorem easily follows. □ 

In the interesting sparsity range k > \ogn and k > 5^'^ , the condition in Theorem l5.71l 
clearly reduces to 

TO > CS~^k log(n) log'^ k. 

Remark 5.72 (Boundedness requirement). The boundedness assumption on the entries 
of A is essential in Theorem 15.711 Indeed, if the rows of A are independent coordinate 
vectors in R", then A necessarily has a zero column (in fact n — to of them). This clearly 
contradicts the restricted isometry property. 

Example 5.73. 1. (Random Fourier measurements): An important example for 
Theorem 15.411 is where A realizes random Fourier measurements. Consider the 
n X n Discrete Fourier Transform (DFT) matrix W with entries 

/ 2Tiiujt\ . -. 

W^.t = exp —y a;,t G {0, ...,n- 1}. 

Consider a random vector X in C" which picks a random row of W (with uniform 
distribution). It follows from Parseval's inequality that X is isotropic 1^ Therefore 



^^For convenience we have developed the theory over IR, while this example is over C. As we noted 
earlier, all our definitions and results can be carried over to the complex numbers. So in this example 
we use the obvious complex versions of the notion of isotropy and of Theorem 15.711 
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the m X n random matrix A whose rows are independent copies of X satisfies the 
assumptions of Theorem 15.411 with K ~ 1. Algebraically, we can view A as a. 
random row sub-matrix of the DFT matrix. 

In compressed sensing, such matrix A has a remarkable meaning - it realizes m 
random Fourier measurements of a signal x £ M". Indeed, y = Ax is the DFT 
of X evaluated at m random points; in words, y consists of m random frequencies 
of X. Recall that in compressed sensing, we would like to guarantee that with 
high probability every sparse signal x € M" (say, | supp(a:)| < k) can be effectively 
recovered from its m random frequencies y — Ax. Theorem 15.711 together with 
Candes-Tao's result (recalled in the beginning of Section imply that an exact 
recovery is given by the convex optimization problem min{||a;|| i : Ax — y} provided 
that we observe slightly more frequencies than the sparsity of a signal: m >> 
CS'^klog{n) log^fc. 

2. (Random sub-matrices of orthogonal matrices): In a similar way, Theo- 
rem [5T7T] applies to a random row sub-matrix A of an arbitrary bounded orthogonal 
matrix W. Precisely, A may consist of m randomly chosen rows, uniformly and 
without replacement c2| from an arbitrary n x n matrix W = {wij) such that 
W*W — nl and with uniformly bounded coefficients, max^j \'Wij\ — 0{1). The 
examples of such W include the class of Hadamard matrices - orthogonal matrices 
in which all entries equal ±1. 

5.7 Notes 

For Section 15.11 We work with two kinds of moment assumptions for random matri- 
ces: sub-gaussian and heavy-tailed. These are the two extremes. By the central limit 
theorem, the sub-gaussian tail decay is the strongest condition one can demand from 
an isotropic distribution. In contrast, our heavy-tailed model is completely general - 
no moment assumptions (except the variance) are required. It would be interesting to 
analyze random matrices with independent rows or columns in the intermediate regime, 
between sub-gaussian and heavy-tailed moment assumptions. We hope that for distribu- 
tions with an appropriate finite moment (say, (2 -|- £:)th or 4th), the results should be the 
same as for sub-gaussian distributions, i.e. no log n factors should occur. In particular, 
tall random matrices {N 3> n) should still be approximate isometrics. This indeed holds 
for sub-exponential distributions [5] ; see [52] for an attempt to go down to finite moment 
assumptions. 

For Section 15.21 The material presented here is well known. The volume argument 
presented in Lemma 15.21 is quite flexible. It easily generalizes to covering numbers of 
more general metric spaces, including convex bodies in Banach spaces. See |601 Lemma 
4.16] and other parts of |60) for various methods to control covering numbers. 



Since in the interesting regime very few rows are selected, m <^ n, sampling with or without 
replacement are formally equivalent. For example, see |67) which deals with the model of sampling 
without replacement. 
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For Section l5.2.3l The concept of sub-gaussian random variables is due to Kahane [3S] . 
His definition was based on the moment generating function (Property 4 in Lemma l5.5p . 
which automatically required sub-gaussian random variables to be centered. We found 
it more convenient to use the equivalent Property 3 instead. The characterization of 
sub-gaussian random variables in terms of tail decay and moment growth in Lemma 15.51 
also goes back to [55] . 

The rotation invariance of sub-gaussian random variables (Lemma 15. 9p is an old 
observation [TS]. Its consequence, Proposition I5.10[ is a general form of Hoeff ding's 
inequality, which is usually stated for bounded random variables. For more on large 
deviation inequalities, see also notes for Section fS. 2. 41 

Khintchine inequality is usually stated for the particular case of symmetric Bernoulli 
random variables. It can be extended for < p < 2 using a simple extrapolation 
argument based on Holder's inequality, see ^S} Lemma 4.1]. 

For Section l5.2.4l Sub-gaussian and sub-exponential random variables can be studied 
together in a general framework. For a given exponent Q < a < oo, one defines general 
Tpa random variables, those with moment growth (E|X|^')^/p = 0(p^/"). Sub-gaussian 
random variables correspond to a = 2 and sub-exponentials to a ~ 1. The reader is 
encouraged to extend the results of Sections 15.2.31 and 15.2.41 to this general class. 

Proposition 15.161 is a form of Bernstein's inequality, which is usually stated for 
bounded random variables in the literature. These forms of Hoeffding's and Bernstein's 
inequalities (Propositions 15 . lOl and [5T6| are partial cases of a large deviation inequality 
for general ^jja norms, which can be found in '^TD, Corollary 2.10] with a similar proof. For 
a thorough introduction to large deviation inequalities for sums of independent random 
variables (and more) , see the books [59l |45l [24] and the tutorial [11] . 

For Section 15.2.51 Sub-gaussian distributions in M" are well studied in geometric 
functional analysis; see |53| for a link with compressed sensing. General tAq distributions 
in E" are discussed e.g. in [32] . 

Isotropic distributions on convex bodies, and more generally isotropic log-concave 
distributions, are central to asymptotic convex geometry (see [211 [S7]) and computational 
geometry |78j . A completely different way in which isotropic distributions appear in 
convex geometry is from John's decompositions for contact points of convex bodies, see 
[H |63l [79] . Such distributions are finitely supported and therefore are usually heavy- 
tailed. 

For an introduction to the concept oi frames fExample l5.2ip . see [4T1 [19] . 

For Section 15.2.61 The non-commutative Khintchine inequality. Theorem 15.261 was 
first proved by Lust-Piquard 48 with an unspecified constant Bp in place of C ^Jp. The 
optimal value of Bp was computed by Buchholz ^13^ ,\A^\ see ^62^ Section 6.5] for an 
thorough introduction to Buchholz's argument. For the complementary range 1 < p < 
2, a corresponding version of non-commutative Khintchine inequality was obtained by 
Lust-Piquard and Pisier ^47) . By a duality argument implicitly contained in |47j and 
independently observed by Marius Junge, this latter inequality also implies the optimal 
order Bp = O(y^), see [65] and [HI Section 9.8]. 
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Rudelson's Corollary 15 .281 was initially proved using a majorizing measure technique; 
our proof follows Pisier's argument from |65j based on the non-commutative Khintchine 
inequality. 

For Section 15.31 The "Bai-Yin law" (Theorem 15. 31|) was established for Smax(^) by 
Geman ^SOj and Yin, Bai and Krishnaiah [84]. The part for Smin(^) is due to Silverstein 
[70] for Gaussian random matrices. Bai and Yin [5] gave a unified treatment of both 
extreme singular values for general distributions. The fourth moment assumption in 
Bai- Yin's law is known to be necessary 

Theorem 15.321 and its argument is due to Gordon [351 IMl ISZ] • Our exposition of this 
result and of Corollary 15.351 follows 21 . 

Proposition 15.341 is just a tip of an iceberg called concentration of measure phe- 
nomenon. We do not discuss it here because there are many excellent sources, some of 
which were mentioned in Section [5. II Instead we give just one example related to Corol- 
lary 15.351 For a general random matrix A with independent centered entries bounded 
by 1, one can use Talagrand's concentration inequality for convex Lipschitz functions on 
the cube [73l[74]. Since Smax(^) — \\A\\ is a convex function of A, Talagrand's concen- 
tration inequality implies P{|smax(^) — Median(smax(^))| > t} < 2e~'^* . Although the 
precise value of the median may be unknown, integration of this inequality shows that 
|Es„iax(A) - Median(si„ax(A))| < C. 

For the recent developments related to the hard edge problem for almost square and 
square matrices (including Theorem 15. 38p see the survey [69] . 

For Section 15.41 Theorem l5.39l on random matrices with sub-gaussian rows, as well as 
its proof by a covering argument, is a folklore in geometric functional analysis. The use 
of covering arguments in a similar context goes back to Milman's proof of Dvoretzky's 
theorem [53] : see e.g. [9 and [60, Chapter 4] for an introduction. In the more narrow 
context of extreme singular values of random matrices, this type of argument appears 
recently e.g. in [2]. 

The breakthrough work on heavy-tailed isotropic distributions is due to Rudelson [65] . 
He used Corollary 15.281 in the way we described in the proof of Theorem 15.451 to show 
that j^A*A is an approximate isometry. Probably Theorem 15.411 can also be deduced 
by a modification of this argument; however it is simpler to use the non-commutative 
Bernstein's inequality. 

The symmetrization technique is well known. For a slightly more general two-sided 
inequality than Lemma 15.461 see |45[ Lemma 6.3]. 

The problem of estimating covariance matrices described in Section 15.4.31 is a basic 
problem in statistics, see e.g. [35]. However, most work in the statistical literature is 
focused on the normal distribution or general product distributions (up to linear trans- 
formations), which corresponds to studying random matrices with independent entries. 
For non-product distributions, an interesting example is for uniform distributions on con- 
vex sets [40, . As we mentioned in Example 15.251 such distributions are sub-exponential 
but not necessarily sub-gaussian, so Corollary 15.501 does not apply. Still, the sample 
size N = 0{n) suffices to estimate the covariance matrix in this case [2]. It is conjec- 
tured that the same should hold for general distributions with finite (e. g. 4th) moment 
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assumption [55^. 

Corollary 15.551 on random sub-matrices is a variant of the Rudelson's result from 
[64] . The study of random sub- matrices was continued in [66] . Random sub- frames were 
studied in [80] where a variant of Corollarv l5.56l was proved. 

For Section 15.51 Theorem 1 5 . 5 81 for sub-gaussian columns seems to be new. However, 
historically the efforts of geometric functional analysts were immediately focused on 
the more difficult case of sub-exponential tail decay (given by uniform distributions on 
convex bodies). An indication to prove results like Theorem 15.581 bv decoupling and 
covering is present in [T^j and is followed in [351 [2] . 

The normalization condition ||Aj||2 — VN in Theorem 15 . 581 can not be dropped but 

can be relaxed. Namely, consider the random variable S := maxi<„ ^ l|- Then 

the conclusion of Theorem 15.581 holds with ()5.36|) replaced by 

(1 - 6)VN ~ C^^; - t < Snun{A) < Smax(A) < (1 + 6)VN + C + t. 

Theorem 15.621 for heavy-tailed columns also seems to be new. The incoherence pa- 
rameter m is meant to prevent collisions of the columns of A in a quantitative way. It is 
not clear whether the logarithmic factor is needed in the conclusion of Theorem 15.621 or 
whether the incoherence parameter alone takes care of the logarithmic factors whenever 
they appear. The same question can be raised for all other results for heavy-tailed ma- 
trices in Section 15.4.21 and their applications - can we replace the logarithmic factors by 
more sensitive quantities (e.g. the logarithm of the incoherence parameter)? 

For Section 15.61 For a mathematical introduction to compressed sensing, see the in- 
troductory chapter of this book and [28l [20] . 

A version of Theorem 15.651 was proved in |54j for the row-independent model; an 
extension from sub-gaussian to sub-exponential distributions is given in [3]. A general 
framework of stochastic processes with sub-exponential tails is discussed in [52' . For the 
column- independent model. Theorem 15.651 seems to be new. 

Proposition 15.661 that formalizes a simple approach to restricted isometry property 
based on concentration is taken from [TO^ . Like Theorem 15.651 it can also be used to 
show that Gaussian and Bernoulli random matrices are restricted isometrics. Indeed, it 
is not difficult to check that these matrices satisfy a concentration inequality as required 
in Proposition 15.661 [T] . 

Section 15.6.21 on heavy-tailed restricted isometrics is an exposition of the results 
from [67] . Using concentration of measure techniques, one can prove a version of The- 
orem |5T7T] with high probability 1 — n~'^^°^ ^ rather than in expectation [52^. Earlier, 
Candes and Tao [18 proved a similar result for random Fourier matrices, although with 
a slightly higher exponent in the logarithm for the number of measurements in (j5.55p . 
m = 0(/fclog^n). The survey [62] offers a thorough exposition of the material presented 
in Section 15.6.21 and more. 
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